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REVIEW  OF  AVIATOR  SELECTION 


EXECUTIVE  SUMMARY 


Research  Requirement: 

In  June  2004,  the  U.S.  Army  Research  Institute  for  the  Behavioral  and  Social  Sciences 
(ARI)  was  tasked  with  conducting  the  research  and  development  towards  a  new  Selection 
Instrument  for  Flight  Training  (SIFT).  The  Army’s  stated  objectives  were:  1 .  Develop  a 
computer-based  and  web-administered  selection  instrument  for  Army  flight  training  with 
emphasis  upon  aptitudes  for  Future  Force  aviator  performance  within  the  Future  Combat 
Systems  environment;  2.  Develop  an  aviator  selection  instrument  that  corrects  or  minimizes  risks 
associated  with  several  deficiencies  identified  in  the  current  selection  instrument  -  the  Alternate 
Flight  Aptitude  Selection  Test  (AFAST);  3.  Develop  the  selection  instrument  so  that  the  Army 
will  be  able  to  rapidly  assess  its  current  performance  as  a  predictor,  revise  the  instrument  when 
necessary  and  adapt  its  application  to  selection  for  related  occupational  categories  such  as 
Unmanned  Aerial  Vehicle  Operators  and  Special  Operations  Aviators;  and,  4.  Maximize 
utilization  (by  inclusion  or  adaptation)  of  existing  tests  as  may  be  found  in  use  or  under 
development  within  the  Department  of  Defense.  The  first  task  was  to  review  the  relevant 
selection  literature.  The  overall  goal  of  this  initial  task  was  to  collect  information  that  could  be 
used  to  produce  a  rational  recommendation  for  a  specific  selection  and  testing  strategy  for  Army 
aviation. 

Procedure: 

A  focused  review  of  aviator  selection  research,  supplemented  by  relevant  research  from 
the  general  personnel  selection  domain,  was  conducted.  The  review  identified  more  than  150 
potentially  relevant  articles.  Rather  than  rely  entirely  on  a  narrative  summary,  a  spreadsheet  was 
developed  to  summarize  information  about  various  test  batteries  and  to  facilitate  comparison  of 
the  test  batteries  when  deriving  a  recommended  selection  strategy.  From  this  analysis,  a 
selection  strategy  for  replacing  the  Army’s  current  aviator  selection  battery  was  recommended. 
The  results  of  this  review  also  informed  the  job  analysis  study  conducted  as  part  of  the  SIFT 
project. 

Findings: 

Research  clearly  suggests  that  cognitive  ability,  or  general  intelligence  (g),  will  be  an 
important  predictor  of  aviator  performance.  However,  there  is  reason  to  believe  that  measures  of 
the  following  constructs  may  add  incremental  validity  beyond  that  achieved  by  a  battery  that 
reliably  and  accurately  measures  general  intelligence:  psychomotor  skills;  selective  and  divided 
attention;  working  memory;  aviation  interest/knowledge;  flying  experience;  and,  personality. 

The  recommended  selection  strategy  is  a  two-stage  testing  process.  The  first  stage  of  testing  will 
measure  cognitive  and  personality/motivational  traits  important  for  the  aviator  job.  These  tests 
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do  not  require  any  non-standard  computer  peripherals  and  can  be  administered  via  the  Internet  in 
virtually  any  location  with  access  to  a  desktop  computer,  the  Internet,  and  a  test  proctor.  The 
second  stage  of  the  test  battery  will  include  performance-based  measures  of  psychomotor  and 
information  processing  skills.  These  tests  do  require  non-standard  computer  peripherals  and  may 
better  serve  the  needs  of  Army  aviation  as  classification  instruments,  for  tracking  selected 
aviators  into  one  of  the  four  mission  platforms.  Both  the  U.S.  Navy  and  the  U.S.  Air  Force 
currently  use  an  aviator  selection  test  battery  that  measures  cognitive  abilities  important  for  U.S. 
Army  Aviators,  and  one  of  these  two  batteries  should  be  adopted  for  Army  aviator  selection.  The 
U.S.  Army  also  possesses  two  non-cognitive  inventories  that  can  be  adapted  for  use  with  the 
Army  aviator  applicant  population.  Finally,  it  is  recommended  that  a  small  number  of  new 
ability  tests  and  non-cognitive  scales  be  developed  to  measure  abilities  or  traits  that  are  not 
currently  measured  by  any  of  the  readily-accessible  test  batteries  or  non-cognitive  instruments. 

Utilization  and  Dissemination  of  Findings: 

This  product  is  one  of  many  emanating  from  the  SIFT  effort.  The  contents  of  this  report 
flow  mainly  into  decision  processes  conducted  internally  to  the  project,  but  also  documents  the 
overall  conduct  of  the  effort.  Documentation  of  the  development  of  this  selection  instrument  is 
necessary  to  provide  a  basis  to  defend  the  scientific  and  theoretical  underpinnings  of  the  test  and 
to  provide  a  detailed  base  from  which  revisions  can  be  made  in  time.  This  report  provides 
information  for  use  in  transition  of  the  selection  instrument  into  operation. 
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REVIEW  OF  AVIATOR  SELECTION 


Introduction 

In  June  2004,  the  US  Army  Research  Institute  for  the  Behavioral  and  Social  Sciences 
(ARI)  awarded  the  Selection  Instrument  for  Army  Flight  Training  (SIFT)  contract  to  Personnel 
Decisions  Research  Institutes  (PDRI).  The  Army’s  stated  objectives  were:  1)  Develop  a 
computer-based  and  web-administered  selection  instrument  for  Army  flight  training  with 
emphasis  upon  aptitudes  for  Future  Force  aviator  performance  within  the  Future  Combat 
Systems  environment;  2)  Develop  an  aviator  selection  instrument  that  corrects  or  minimizes  risks 
associated  with  several  deficiencies  identified  in  the  current  selection  instrument  -  the  Alternate 
Flight  Aptitude  Selection  Test  (AFAST);  3)  Develop  the  selection  instrument  so  that  the  Army 
will  be  able  to  rapidly  assess  its  current  performance  as  a  predictor,  revise  the  instrument  when 
necessary  and  adapt  its  application  to  selection  for  related  occupational  categories  such  as 
Unmanned  Aerial  Vehicle  Operators  and  Special  Operations  Aviators;  and,  4)  Maximize 
utilization  (by  inclusion  or  adaptation)  of  existing  tests  as  may  be  found  in  use  or  under 
development  within  the  Department  of  Defense. 

The  project  was  divided  into  several  tasks.  This  report  summarizes  efforts  conducted  in 
relation  to  Task  1 :  Review  the  existing  Army  aviation  accession  process  and  relevant  literature. 
The  overall  goal  of  Task  1  was  to  collect  information  that  could  be  used  to  produce  a  rational 
decision  on  a  specific  selection  and  testing  strategy. 

Overview  of  Existing  Army  Aviation  Accession  Procedures 

A  review  of  existing  Army  aviation  accession  procedures  was  conducted  to  provide  the 
context  for  recommending  a  replacement  for  the  AFAST.  This  included  reviewing  Army 
regulations  and  other  documents.  US  Army  aviators  are  Commissioned  or  Warrant  Officers. 
Commissioned  Officers  primarily  come  from  a  military  academy,  or  from  a  Reserve  Officer 
Training  Corps  (ROTC)  or  Officer  Candidate  School  (OCS)  program.  Civilians  and  enlisted 
personnel  from  any  branch  of  the  US  military  may  apply  to  become  an  Army  Aviation  Warrant 
Officer.  Prior  to  volunteering  for  aviation  duty,  candidates  must  meet  standards  for  becoming  a 
Commissioned  Officer  or  a  Warrant  Officer  in  the  US  Army.  Among  other  things,  this  includes 
meeting  physical  and  medical  standards,  and  earning  a  qualifying  score  on  the  relevant 
admission  exam  (Scholastic  Aptitude  Test  or  the  American  College  Test  for  Commissioned 
Officers;  Armed  Services  Vocational  Aptitude  Battery  (ASVAB)  General-Technical  (GT) 
Composite  for  Warrant  Officers). 

Candidates  who  apply  to  become  an  Army  aviator  must  meet  additional  standards  beyond 
those  described  above.  The  selection  process  is  rigorous  and  there  are  typically  five  to  ten 
applicants  for  every  available  training  seat.  Selection  standards  are  highly  similar  across  all 
accession  sources  but  the  exact  procedures  vary  to  some  degree,  depending  on  whether  the 
applicant  is  a  Commissioned  versus  a  Warrant  Officer,  the  source  from  which  he/she  comes 
(e.g.,  US  Army  versus  US  Army  National  Guard  or  Reserve),  and  whether  or  not  the  applicant  is 
already  on  active  duty  at  the  time  of  application.  In  general,  all  Army  Aviator  applicants  must 
meet  physical  fitness  and  medical  standards  beyond  those  required  to  become  a  Commissioned 
or  Warrant  Officer,  meet  minimum  and  maximum  age  requirements,  earn  a  qualifying  score  on 
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the  AFAST,  and  be  recommended  by  a  selection  board.  Flight  experience  and  post  high-school 
coursework  or  degree  are  preferred  but  not  required. 

The  following  Army  regulations  (AR)  and  other  documents  outline  selection  and  testing 
requirements: 

•  Selection  and  Training  of  Army  Aviation  Officers  (AR  611-110, 14  Nov  2003) 

•  Aviation  Warrant  Officer  Training  (AR  61 1-85,  15  June  1981) 

•  Army  Personnel  Selection  and  Classification  Testing  (AR  611-5, 10  June  2002) 

•  Appointment  of  Commissioned  and  Warrant  Officers  of  the  Army  (AR  135-100, 1 
Sept  1994) 

•  Warrant  Officer  Procurement  Program  (Department  of  the  Army  Circular  601-99-1,  23 
April  1999) 

•  Warrant  Officer  Professional  Development  (Department  of  the  Army  Pamphlet  600- 
11,  30  Dec  1996) 

•  Order  to  Active  Duties  as  Individuals  Other  than  a  Presidential  Selected  Reserve  Call¬ 
up,  Partial  or  Full  Mobilization  (AR  135-210,  17  Sept  1999) 

•  Policies  and  Procedures  for  Active-Duty  List  Officer  Selection  Boards  (Department  of 
the  Army  Memo  600-2,  24  Sept  1 999) 

After  candidates  are  selected  as  Army  aviators,  they  report  to  Ft.  Rucker,  AL  for  training. 
All  candidates  complete  an  1 8-week  Initial  Entry  Rotary  Wing  (IERW)  core  training  program 
and  a  two-week  Basic  Navigation  course,  followed  by  12  to  20  weeks  of  training  in  a  specific 
operational  aircraft.  Student  aviators  are  assigned,  or  “classified”  into  one  of  four  tracks  for 
aircraft-specific  training:  Scout,  Attack,  Cargo,  or  Utility.  Classification  decisions  are  currently 
based  in  part  on  academic  grades  in  IERW  and  in  part  on  the  needs  of  the  Army.  Upon 
completion  of  aircraft-specific  training,  aviators  are  assigned  to  a  Military  Occupational 
Specialty  (MOS)  that  corresponds  to  the  type  of  aircraft  they  are  qualified  to  fly,  and  they  begin 
their  first  operational  tour  as  an  Army  Aviator. 

Brief  History  of  Aviator  Selection 

The  prediction  of  aviator  performance  played  a  prominent  role  in  the  military  research 
and  development  arena  for  most  of  the  last  century.  In  a  review  of  aviator  selection  research, 
Hunter  (1989)  explained  that  this  continued  emphasis  is  a  result  of  the  expense  involved  in 
aviator  training,  noting  that,  almost  without  exception,  aviator  training  is  the  most  expensive  of 
the  training  programs  conducted  by  the  military  services.  The  US  Navy  estimates  that  the  sunk 
costs  for  student  aviators  who  fail  training  range  from  $500,000  to  $1,000,000,  depending  on  the 
stage  at  which  failure  occurs  (Helm  &  Reid,  2003).  According  to  Carretta  and  Ree  (2000), 
estimates  of  the  cost  of  each  person  who  failed  to  complete  US  Air  Force  (USAF)  undergraduate 
aviator  training  range  from  $50,000  (Hunter,  1989)  to  $80,000  (Siem,  Carretta,  &  Mercatante, 
1988).  The  amount  approaches  $500,000  per  candidate  by  the  end  of  flight  school  for  US  Army 
aviators. 
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Since  World  War  I,  the  military  services  have  explored  the  relationships  between 
measures  of  a  wide  variety  of  personal  characteristics  and  aviator  performance.  As  early  as 
World  War  I,  tests  of  mental  alertness  and  emotional  stability  were  found  to  be  predictive  of 
aviator  success  (North  &  Griffin,  1977).  Between  World  War  I  and  World  War  II,  measures  of 
psychomotor  coordination  received  the  primary  emphasis  in  aviator  selection  research.  A  flurry 
of  developmental  activity  produced  “aircraft-like  controls”  for  use  in  measuring  complex 
coordination,  two-hand  coordination,  rudder  control  skills,  dual-task  performance,  and  the  like. 

A  number  of  these  psychomotor  tests,  especially  those  of  a  more  complex  nature,  were  found  to 
be  valid  for  aviator  selection.  However,  in  the  early  1950s  psychomotor  tests  were  largely 
abandoned  as  a  result  of  persistent  problems  with  reliability  and  maintainability  of  these 
electromechanical  devices  (Hunter,  1989). 

With  the  advent  of  World  War  II,  research  on  aviator  selection  and  classification 
expanded  to  include  measurement  of  additional  abilities  such  as  spatial  orientation  and  the  use  of 
new  testing  tools  (e.g.,  motion  pictures,  photographs).  Much  of  what  is  known  today  about 
spatial  and  psychomotor  abilities,  as  well  as  several  other  related  attributes,  stems  from  the 
classic  Army  Air  Force  (AAF)  work  (Guilford  &  Lacey,  1947;  Melton,  1947)  and  the  Navy’s 
Pensacola  1000  Aviator  Study  (Franzen  &  McFarland,  1945).  After  the  war,  Fleishman  and  his 
colleagues  continued  psychomotor  abilities  research  (e.g.,  Fleishman,  1967,  1972;  Fleishman  & 
Hempel,  1954).  Researchers  also  investigated  personality  characteristics  related  to  attrition  from 
aviator  training  and/or  aviator  performance  (Griffin  &  Mosko,  1977). 

Within  the  last  few  decades,  innovations  in  aviator  selection  and  classification  have 
centered  on  attributes  such  as  multi-task  performance  (e.g.,  Griffin  &  McBride,  1986),  division 
of  attention  (e.g.,  Carretta,  1987d),  decision  making  speed  (e.g.,  Carretta,  1988),  and  attitudinal 
and  motivational  traits  (Foushee  &  Helmreich,  1986;  Helmreich,  Foushee,  Benson  &  Russini, 
1986).  Personality  also  received  a  good  deal  of  attention  in  the  past  two  decades.  Much  of  the 
early  work  was  exploratory  in  nature,  attempting  to  determine  which  personality  traits  were 
related  to  various  outcomes  relevant  for  aviators,  but  not  necessarily  guided  by  any  particular 
theory  of  personality  or  aviator  performance.  For  example,  several  researchers  administered 
personality  inventories  that  had  been  well  established  as  useful  for  purposes  other  than  aviator 
selection,  including  the  Minnesota  Multiphasic  Personality  Inventory  (Caldwell,  O’Hara, 
Caldwell,  Stephens,  &  Krueger,  1993),  Eysenck  Personality  Inventory  (Bartram  &  Dale,  1982; 
Jessup  &  Jessup,  1971),  and  the  Edwards  Personal  Preference  Schedule  (Fry  &  Reinhardt,  1969). 
Other  researchers  developed  their  own  inventory,  for  example,  the  programmatic  research 
conducted  by  the  USAF  that  eventually  led  to  the  NEO-PI  and  the  Self-Description  Inventory 
(Christal,  1975;  Christal,  Barucky,  Driskill,  &  Collis,  1997;  Tupes  &  Christal,  1961). 

Some  of  the  research  specifically  focused  on  developing  personality  profiles  for 
helicopter  aviators  (Caldwell,  et  al.,  1993;  Geist  &  Boyd,  1980;  Harrs,  Kastner,  &  Beerman, 
1991;  Howse,  1995).  Another  arena  of  increasing  importance  is  selection  of  individuals  to  fly 
unmanned  aerial  vehicles  (UAVs).  For  example,  US  Navy  researchers  have  examined  the 
validity  of  a  test  battery  designed  to  measure  psychomotor,  multi-tasking,  and  visuospatial 
abilities  in  a  small  sample  of  UAV  operators,  with  promising  results  (Phillips,  Arnold,  & 
Fatolitis,  2003). 
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This  report  describes  the  specific  procedures,  findings,  and  implications  of  a  focused 
review  of  the  aviator  selection  literature.  As  an  initial  step  in  the  development  of  SIFT,  the  goal 
of  this  review  was  to  produce  a  rational  recommendation  for  a  specific  selection  and  testing 
strategy  for  Army  aviation.  Therefore,  consideration  was  given  to  methodological  limitations 
and  obstacles  in  conducting  selection  research,  as  well  as  to  the  incremental  validities  and 
practical  issues  associated  with  the  tests  being  studied. 

Focused  Literature  Review 

As  noted  above,  aviator  selection  and  classification  research  has  been  conducted  since  the 
1920’s  and  a  tremendous  amount  has  been  written  on  this  subject.  This  focused  literature  review 
was  designed  to  provide  a  research-based  foundation  for  a  recommended  selection  strategy. 
Therefore,  no  attempt  was  made  to  review  every  aviator  selection  study  that  has  ever  been 
conducted.  Rather,  the  focus  was  on  key  studies  related  to  currently  or  recently  available 
selection  batteries,  particularly  those  studies  conducted  by  the  US  military. 

The  specific  goals  for  conducting  this  literature  review  were  to: 

1 .  Review  studies  that  delineate  the  knowledge,  skill,  ability,  and  other  characteristics 
(KSAOs)  important  for  performing  the  aviator  job,  with  particular  emphasis  on  studies 
that  involve  helicopter  aviators.  This  information  would  help  inform  the  job  analysis 
phase  of  the  project. 

2.  Review  studies  that  focus  on  aviator  selection  batteries  currently  (or  recently)  in  use 
by  the  US  Air  Force,  US  Navy,  and  other  relevant  organizations  (e.g.,  foreign  military, 
commercial  airlines). 

Literature  Review  Methodology 

The  first  step  in  this  task  was  to  identify  currently  or  recently  available  test  batteries  that 
might  be  viable  candidates  for  consideration  as  a  replacement  for  the  AFAST  and,  once  those 
were  identified,  to  locate  and  summarize  key  research  about  them.  This  step  requires 
consideration  of  a  wide  range  of  possible  tests  or  test  batteries,  with  the  expectation  that,  at  a 
later  date,  a  number  of  potential  candidates  would  be  ruled  out  with  relative  ease  (e.g.,  test 
batteries  that  cannot  be  computerized  or  ones  that  involve  prohibitively  expensive  licensing 
fees).  There  was  a  possibility  that  one  or  more  existing  test  batteries  would  be  recommended  as 
an  intact  entity,  with  minimal  changes,  or  of  recommending  specific  subtests  from  a  variety  of 
existing  batteries. 

Seven  on-line  databases  were  searched  first,  to  obtain  pertinent  literature.  These  included 
Psychlnfo,  Defense  Technical  Information  Center  (DTIC),  the  Air  Force  Research  Laboratory 
Research  Archive  Library,  the  Civil  Aeromedical  Institute  database  of  technical  reports,  the 
Naval  Medical  Research  Laboratory  database  of  technical  reports,  and  the  archives  of  the  Human 
Factors  and  Ergonomics  Society  (HFES).  The  HFES  database  covers  all  of  the  Society’s 
publications,  including  the  Society’s  bulletin  and  magazine.  The  sixth  database  searched  was  the 
United  States  Air  Force  Human  Resources  Laboratory  (AFHRL,  1 968-1 998)  Topics,  hosted  by 
the  Innovation  Center  for  Occupational  Data,  Applications,  and  Practices.  All  of  these  databases 
were  searched  using  terms  such  as  “aviator  selection,”  “ab  initio”  (from  the  beginning), 
“personality,”  and  “psychomotor.”  Personnel  Decisions  Research  Institutes  (PDRI)  also 
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searched  its  archives  for  articles  and  technical  reports  related  to  aviator  selection,  based  on  prior 
work  with  the  US  Air  Force,  particularly  in  the  area  of  Crew  Resource  Management  (CRM). 

In  addition,  the  Damos  Aviation  Services  (DAS)  database  was  searched,  which  consists 
primarily  of  articles  related  to  aviator  selection  and  performance.  This  database  currently  has 
over  3800  entries.  The  earliest  entry  pertaining  to  aviator  selection  in  the  DAS  library  dates 
from  1921.  It  contains  references  to  both  civilian  and  military  aviator  selection,  and  a  substantial 
proportion  of  the  entries  are  concerned  with  foreign  aviator  selection.  The  DAS  database  covers 
all  of  the  International  Journal  of Aviation  Psychology,  all  of  the  Proceedings  of  the 
International  Symposium  on  Aviation  Psychology,  and  the  last  19  years  of  Aviation,  Space,  and 
Environmental  Medicine.  Any  recent  materials  that  had  not  yet  been  entered  into  the  database 
were  searched  by  hand.  Hand  searches  also  were  conducted  on  recently  edited  books  that  had 
not  yet  been  entered  into  the  database.  Several  individuals  involved  with  aviator  selection  were 
also  contacted  to  obtain  updates  on  their  current  aviator  selection  research  projects. 

Findings  from  Aviator  Selection  Research  Literature 

Most  of  the  research  on  aviator  selection  has  been  conducted  by  the  military  in  the  United 
States,  the  United  Kingdom,  and  Norway.  Some  research  was  also  published  by  military 
organizations  in  other  countries  (e.g.,  Israel,  Turkey)  and  in  the  commercial  sector.  The  Federal 
Aviation  Administration  (FAA)  and  National  Aeronautics  and  Space  Administration  (NASA) 
have  both  conducted  research  in  the  arenas  of  cognitive  and  non-cognitive  testing.  Of  most 
relevance  for  the  present  research  is  work  conducted  by  NASA  in  the  area  of  personality  traits 
impacting  aircrew  performance  (e.g.,  Helmreich,  Foushee,  Benson,  &  Russini,  1986;  Musson, 
Sandal,  &  Helmreich,  2004)  and  work  originated  by  the  FAA’s  Civil  Aeromedical  Institute 
(CAMI)  on  a  test  battery  called  CogScreen  (King  &  Flynn,  1995).  The  following  sections 
summarize  key  research  found  in  the  aviator  selection  research  literature,  as  well  as  in  the 
general  selection  research  literature. 


General  aviator  selection  reviews.  A  number  of  reviews  of  the  aviator  selection  literature 
have  been  published  (Carretta  &  Ree,  2000,  2003;  Dolgin  &  Gibb,  1988;  Griffin  &  Koonce, 

1996;  Hunter,  1989;  North  &  Griffin,  1977;  Ree  &  Carretta,  1996,  1998;  Rogers,  Roach,  & 

Short,  1986;  Tirre,  1997;  Turnbull,  1992),  including  one  that  focuses  specifically  on 
methodological  difficulties  and  common  shortfalls  associated  with  such  research  (Damos,  1 996). 
In  their  review  of  aviator  selection  methods,  Carretta  &  Ree  (2000)  state,  “Research  results  point 
to  g  [general  intelligence]  as  the  most  important  underlying  construct  in  the  prediction  of  aviator 
success.  Clearly,  three  others  have  been  shown  to  be  important  but  to  a  smaller  degree:  flying 
job  knowledge,  personality,  and  general  psychomotor  ability”  (p.  31).  These  authors  note  that, 
“Simulation-based  tests  may  significantly  increment  the  validity  of  cognitive  tests  when  the  two 
approaches  are  used  together.  These  results  are  consistent  with  a  large-scale  meta-analysis  of  1 9 
commonly  used  personnel  selection  methods  across  many  occupations  (Schmidt  &  Hunter, 
1998)”  (p.  24).  Regarding  personality  measures,  Carretta  and  Ree  comment  that  a  great  deal  of 
research  has  been  conducted  in  this  area,  with  contradictory  results.  They  go  on  to  say  that 
organizing  the  results  according  to  the  Big  Five  personality  variables  of  Neuroticism, 
Extraversion,  Openness,  Agreeableness,  and  Conscientiousness  (Norman,  1963;  Tupes  & 
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Christal,  1961)  would  likely  be  enlightening,  but  has  not  (yet)  been  done  in  the  aviator  selection 
arena. 


Griffin  and  Koonce  (1996)  also  wrote  a  comprehensive  review  of  aviator  selection,  with 
particular  emphasis  on  measures  of  psychomotor  skills.  They  review  numerous  research  studies 
showing  that  several  types  of  predictor  measures  are  valid  for  predicting  aviator  performance, 
including: 

•  aptitude  (cognitive  ability); 

•  psychomotor  skills; 

•  work  simulation; 

•  divided  attention  (or  multiple-task  performance); 

•  flying  experience;  and, 

•  biographical  information. 

According  to  these  authors,  uncorrected,  zero-order  correlations  for  psychomotor  skills 
are  in  the  .30  to  .40  range  and  multiple  regression  correlations  are  in  the  .50  range  in  research 
studies  involving  continuous  criterion  measures  such  as  instructor  check/flight  ride  ratings.  With 
regard  to  measures  of  psychomotor  skills,  the  authors  concluded. 

Automated  versions  of  vintage  psychomotor  tests  (developed  in  the  1930s  and  1940s) 
seem  to  be  as  predictive  of  military  aviator/aviator  performance  today  as  in  the  past.  The 
use  of  computers  may  have  enhanced  the  predictive  power  of  the  psychomotor  tests  by 
making  their  functioning  dependent  on  digital  electronic  circuitry,  rather  than  analog 
electromechanical  devices,  resulting  in  more  reliable  performance  measurement.  The 
psychomotor  tests  receiving  the  most  attention  today  are  the  CCT  [complex  coordination 
test]  and  the  THCT  [two-hand  coordination  test],  originally  developed  by  Mashbum  and 
colleagues  before  World  War  II  (Mashbum,  1 934).  These  tests  were  significant 
predictors  of  USAF  and  Navy  pass-fail  criteria  in  the  past,  and  automated  versions  are 
predictive  today.  However,  the  tests  are  better  predictors  of  normally  distributed, 
continuous  criteria  such  as  flight  grades  and  number  of  flight  hours  for  the  Navy  and 
check  rides  and  advanced  training  ratings  for  the  USAF  [than  of  traditional  pass-fail]  (p. 
143). 

Time  (1997)  made  a  useful  distinction  between  two  different  approaches  to  aviator 
selection: 

•  Basic  attributes  -  In  this  approach,  the  test  battery  measures  specific  attributes  that 
are  assumed  to  underlie  aviator  performance.  Examples  of  this  approach  include  the 
USAF’s  Air  Force  Officer  Qualifications  Test  (AFOQT)  and  Basic  Aviator  Test 
(BAT;  Carretta,  1987a). 

•  Learning  sample  (simulation)  -  In  this  approach,  the  test  battery  simulates  tasks 
performed  in  flight,  with  varying  degrees  of  realism.  An  example  is  the  Canadian 
Automated  Pilot  Selection  System  (CAPSS). 
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Each  approach  has  advantages  and  disadvantages,  many  of  which  are  outlined  by  Tirre. 
The  basic  attributes  approach  has  a  long  history  outside  aviator  selection  and  is  generally  less 
costly  and  time-consuming  to  develop  and  administer  than  the  learning  sample  approach.  In  fact, 
it  has  only  been  possible  to  use  the  learning  sample  approach  widely  and  effectively  with  the 
advent  of  powerful  desktop  computers.  The  learning  sample  approach  offers  the  advantage  of 
dynamic  (as  opposed  to  static)  measurement  of  cognitive  processing  skills  and  often  involves 
measures  that  appear  very  realistic  to  test-takers.  With  either  approach,  the  reliability  and 
validity  of  the  measurement  tool  depends  critically  on  how  carefully  it  was  developed. 


Obstacles  and  Issues  in  Conducting  Aviator  Selection  Research 

There  are  several  obstacles  to  conducting  research  studies  in  the  aviator  selection  domain, 
many  of  which  have  been  recognized  for  a  long  time  and  many  of  which  are  exceedingly 
difficult  to  overcome.  These  issues  have  been  described  in  several  of  the  preceding  reviews 
(e.g.,  Carretta  &  Ree,  2000;  Damos,  1996),  and  the  most  important  ones  are  summarized  below. 
When  reviewing  the  literature,  it  became  clear  that  some  researchers  recognized  these  obstacles 
and  acknowledged  how  their  study  results  and  conclusions  were  likely  impacted;  many  others 
did  not. 

Training  Performance  as  a  Criterion  Measure 

The  criterion  measure  in  aviator  selection  research  studies  is  almost  always  a  measure  of 
training  performance.  While  training  performance  is  clearly  an  important  outcome  measure,  it 
certainly  is  not  the  only  outcome  variable  of  interest.  Unfortunately,  it  is  exceedingly  difficult  to 
obtain  reliable  and  accurate  measures  of  aviator  performance  after  training.  The  reliance  on 
training  performance  as  a  criterion  measure  is  particularly  problematic  because  researchers  are 
typically  unable  to  differentiate  various  types  of  “failure.”  Different  abilities  or  traits  may 
underlie  different  types  of  failure,  but  the  pattern  of  relationships  will  be  difficult  or  impossible 
to  detect  if  there  is  no  way  to  identify  and  code  the  reason(s)  for  failure. 

The  reliance  on  training  outcome  measures  is  also  problematic  when  attempting  to 
evaluate  the  validity  of  predictor  measures  that  would  not  necessarily  be  expected  to  predict 
training  performance  (e.g.,  personality  measures).  Research  conducted  as  part  of  the  US  Army’s 
Project  A  shows  that  measures  of  cognitive  ability  predict  declarative  knowledge  and  technical 
components  of  performance  (McCloy,  Campbell,  &  Cudeck,  1994)  while  measures  of  non- 
cognitive  characteristics  predict  motivational  aspects  of  job  performance  (Campbell,  Hanson,  & 
Oppler,  2001;  McCloy,  Campbell,  &  Cudeck,  1994)  and  contextual  performance  (Borman, 
Penner,  Allen,  &  Motowidlo,  2001;  Campbell,  Harris,  &  Knapp,  2001;  Campbell  &  Knapp, 
2001).  While  motivational  factors  certainly  play  a  role  in  training  performance  and  most 
students  are  highly  motivated  to  succeed,  the  type  of  training  criterion  measures  typically  used  in 
aviator  selection  research  do  not  separate  technical  performance  and  motivational  aspects  of 
performance.  Thus,  training  criterion  measures  are  likely  more  heavily  weighted  toward 
academic  and  technical  aspects  of  performance  (e.g.,  flight  instructor  ratings,  grades,  pass-fail 
status)  and  less  heavily  on  motivational  aspects  of  performance. 
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Statistical/Methodological  Issues 

Aviator  selection  research  is  plagued  by  a  number  of  statistical  and  methodological 
issues.  Some  of  them  are  extremely  difficult,  if  not  impossible,  to  overcome. 

1 .  Using  predictor  or  criterion  measures  of  low  or  unknown  reliability.  In  many  cases, 
the  reliability  of  predictor  and  criterion  measures  is  not  reported  and  may  be  quite 
low,  particularly  in  the  case  of  criterion  measures.  Thus,  the  impact  of  unreliable 
measurement  on  the  outcomes  of  the  study  cannot  be  evaluated. 

2.  The  most  common  criterion  measure  is  a  dichotomous  variable  -  pass-fail  status  at 
the  end  of  training.  When  working  with  dichotomous  criterion  variables,  the  highest 
possible  value  of  a  correlation  between  any  predictor  measure  and  that  criterion 
variable  depends  on  the  distribution  of  the  dichotomous  variable.  The  maximum 
possible  value  of  the  correlation  is  lower  the  more  the  distribution  varies  from  a  50-50 
split.  For  aviator  training  pass-fail  status,  the  pass-fail  distribution  is  usually  much 
more  extreme  than  50-50.  It  is  possible  to  correct  the  correlation  coefficient  for 
dichotomization,  and  some  researchers  did  this.  It  is  important  to  note  that,  while 
pass-fail  performance  in  training  is  impacted  by  the  attitudes  and  skills  of  student 
aviators,  it  is  also  impacted  by  the  policies  of  aviator  accession  and  training 
organizations.  When  there  is  a  strong  need  for  aviators,  for  example  during  war, 
there  is  strong  pressure  to  ensure  that  virtually  all  students  will  pass  training.  In 
addition,  most  aviator  training  programs  make  every  effort  to  ensure  that  most 
students  pass  training  because  it  is  very  costly  to  fail  a  candidate  after  several  weeks 
of  expensive  training. 

3.  Aviator  selection  research  is  based  on  a  highly  selected  and  homogeneous 
population.  Before  they  begin  an  aviator  training  program,  all  applicants  have  been 
extensively  screened,  including  meeting  a  required  minimum  score  to  enter  the 
military,  meeting  a  required  minimum  score  on  an  aviator  aptitude  battery,  meeting 
education  requirements,  and/or  earning  strong,  positive  evaluations  and 
recommendations  from  a  superior  officer  or  a  selection  board.  The  samples  used  in 
most  aviator  selection  research  are  also  typically  highly  homogeneous  in  terms  of 
race  and  gender.  Screening  occurs  in  multiple  stages,  with  each  stage  serving  to 
further  restrict  the  sample  relative  to  the  general  population.  Correlations  can  be 
corrected  for  some  types  of  range  restriction,  but  there  is  disagreement  about  the 
extent  to  which  such  corrections  should  be  made.  Damos  (1996)  argues  that  it  is  not 
appropriate  to  make  such  corrections  because  aviators  will  never  be  selected  from  an 
unrestricted  sample.  In  addition,  some  types  of  restriction  cannot  be  corrected  for, 
including  the  demographic  composition  of  the  sample. 

4.  Failure  to  correct  for  capitalization  on  chance.  A  number  of  researchers  in  the 
aviator  selection  domain  have  used  regression  techniques  to  evaluate  the  validity  of  a 
test  battery,  without  recognizing  or  correcting  for  the  fact  that  such  techniques 
capitalize  on  chance  variations  present  in  their  sample.  The  reported  multiple 
correlation  may  not  generalize  to  a  new  sample,  especially  if  the  original  sample  was 
not  very  large. 
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5.  Small  sample  sizes.  In  some  aviator  selection  research,  the  sample  is  very  small. 

This  means  there  may  have  been  very  little  power  to  detect  significant  relationships 
even  if  they  did  exist. 

6.  Measurement  method  is  confounded  with  measurement  target.  As  Carretta  and  Ree 
(2000),  Hough  (2001),  and  others  have  noted,  in  some  research  studies,  measurement 
method  (e.g.,  biodata  or  personality  inventory)  is  confounded  with  the  measurement 
target  (i.e.,  KSAOs).  In  some  cases,  there  is  a  close  correspondence  between 
measurement  method  and  measurement  target.  For  example,  “psychomotor  tests” 
virtually  always  measure  one  or  more  psychomotor  abilities,  and  typically  very  little 
else.  In  contrast,  the  “biodata”  or  “personality”  measurement  method  can  be  used  to 
target  leadership  tendencies,  conscientiousness,  stress  tolerance,  psychopathology, 
motivation,  or  other  KSAOs.  Summarizing  findings  across  all  biodata  inventories  or 
all  personality  inventories  tells  us  little  about  which  underlying  traits  are  more  and 
less  predictive  of  aviator  performance.  The  situation  is  worsened  by  the  fact  that  not 
all  biodata  and  personality  inventories  measure  the  same  targets.  Thus,  across 
studies,  there  may  be  a  great  deal  of  variation  in  the  extent  to  which  relevant  and 
irrelevant  KSAOs  are  measured. 

Low  Base  Rate 

Predictor  variables.  Some  tests  are  designed  to  identify  applicants,  aviator  trainees,  or 
experienced  aviators  who  have  a  severe  psychopathological  problem  or  a  neurological  deficit. 
Tools  such  as  CogScreen,  dichotic  listening  tests,  and  the  MMPI  have  been  used  for  this  purpose. 
King  and  Flynn  (1995)  describe  CogScreen  as  “a  self-administered  screening  tool,  in  which  the 
subject  uses  a  light  pen  on  a  cathode  ray  tube  monitor.  CogScreen  may  be  superior  to  traditional 
neuropsychological  testing  in  determining  cognitive  deficits  after  a  central  nervous  system  injury 
or  dementing  disease  ....  The  CogScreen  is  very  sensitive  to  the  nuances  of  neuropsychological 
functioning  and  can  be  administered  in  a  group  setting”  (p.  954).  No  validity  studies  for 
CogScreen  could  be  located.  The  USAF  explored  the  possibility  of  using  CogScreen  for  aviator 
medical  screening,  but  it  was  never  used  operationally  for  aviator  selection.  The  US  Navy  is 
currently  including  CogScreen,  or  a  variation  of  it,  in  their  ongoing  studies  to  enhance  aviator 
selection. 

Severe  psychopathology  and  neurological  deficits  are  rare  in  the  general  population,  and 
are  even  rarer  in  the  highly-selected  population  of  aviators  (including  applicants  and  trainees). 
While  it  may  be  exceedingly  important  to  identify  individuals  in  the  aviator  population  who 
might  or  will  experience  these  problems,  doing  so  is  literally  like  “looking  for  a  needle  in  a  hay 
stack.”  The  low  base  rate  for  these  problems  makes  it  extremely  difficult  to  show  a  statistically 
significant  relationship  between  test  scores  and  outcome  measures,  even  if  the  test  is  valid.  In 
past  research,  the  failure  to  find  significant  correlations  for  these  types  of  tests  was  sometimes 
inappropriately  generalized  to  all  tools  of  a  particular  type,  for  example,  all  personality 
inventories.  Callister,  King,  Retzlaff,  and  Marsh  (1999)  point  out,  “Testing  for  psychopathology 
has  been  shown  to  be  of  limited  value  in  the  assessment  of  the  highly- functioning  aviator 
population.  On  the  other  hand,  measures  of  normal  personality  characteristics  have  been  shown 
to  be  useful  in  a  variety  of  settings  and  populations”  (p.  885). 
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Criterion  variables.  As  noted  above,  the  most  common  criterion  measure  is  pass-fail 
status  at  the  end  of  aviator  training,  and  the  base  rate  for  failure  is  typically  low.  The  low  base 
rate  issue  becomes  even  more  extreme  when  researchers  attempt  to  categorize  failure  according 
to  type  or  reason,  for  example,  failure  due  to  lack  of  technical  competence  versus  failure  due  to 
attitudinal  problems,  or  when  the  training  failure  rate  is  mandated  by  policy  to  be  extremely  low. 

Factor-Analytic  Work  in  the  Aviator  Selection  Research  Literature 

In  spite  of  the  aforementioned  limitations,  selection  test  developers  have  continued  to 
search  for  measures  that  might  predict  aviation  performance.  Accordingly,  researchers  have 
factor  analyzed  scores  on  several  aviator  selection  batteries  to  uncover  which  constructs  yield 
incremental  predictive  validity.  Most  of  the  work  was  conducted  by  USAF  researchers.  Several 
of  these  studies  are  summarized  below. 

Carretta  and  Ree  (1997a)  administered  the  Armed  Services  Vocational  Aptitude  Battery 
(ASVAB)  and  17  psychomotor  tests  to  enlisted  USAF  personnel  ( n  =  429).  They  summarized 
their  findings  as  follows: 

Confirmatory  factor  analysis  yielded  higher-order  factors  of  general  cognitive  ability  (g) 
and  psychomotor/technical  knowledge  (PM/TK).  PM/TK  was  interpreted  as  Vernon’s 
(1969)  practical  factor  (k:m).  In  the  joint  analysis  of  these  batteries,  g  and  PM/TK  each 
accounted  for  about  31%  of  the  common  variance.  No  residualized  lower-order  factor 
accounted  for  more  than  7%.  PM/TK  influenced  a  broad  range  of  lower-order 
psychomotor  factors.  The  first  practical  implication  of  these  findings  is  that  psychomotor 
tests  are  expected  to  be  at  least  generally  interchangeable.  A  second  implication  is  that 
the  incremental  validity  of  psychomotor  tests  beyond  cognitive  tests  is  expected  to  be 
small  (p.  165). 

Ree  &  Carretta  (1992)  conducted  a  similar  study,  using  the  ASVAB  and  three 
psychomotor  tests  from  the  Basic  Aviator  Test  (Carretta,  1 987a).  The  sample  was  354  USAF 
enlisted  recruits.  They  found  that  the  two  types  of  tests  correlated  with  each  other,  with  average 
correlations  in  the  .30’s  (corrected  for  range  restriction,  but  not  for  test  unreliability).  They  also 
found  that,  as  expected,  there  was  a  large  first  factor,  which  they  labeled  “psychometric  g,  ”  and 
that  both  the  ASVAB  and  the  psychomotor  tests  loaded  on  it.  Confirmatory  factor  analyses 
revealed  that  both  a  seven-factor  and  a  nine-factor  model  fit  the  data  equally  well.  The  more 
parsimonious  seven-factor  model  includes  psychometric  g  and  a  higher-order  general 
psychomotor  factor  which  accounted  for  57%  and  9%  of  the  total  variance  respectively.  Other 
factors  included  1)  Verbal-Technical  (accounted  for  an  additional  8%  of  the  variance),  2)  Non¬ 
technical  General  Knowledge  (10%),  3)  Time-Sharing  (4%),  4)  Two  Hand  Coordination  (7%), 
and  5)  Complex  Coordination  (5%). 

Carretta  and  Ree  (1998)  compared  the  factor  structure  of  the  ASVAB  with  the  factor 
structure  of  the  AFOQT.  The  factor  structure  for  each  test  battery  was  derived  in  a  different 
sample  of  USAF  personnel,  because  the  two  batteries  differ  in  difficulty  level  and  intended 
audience  (with  the  ASVAB  being  taken  by  all  Air  Force  applicants  and  the  AFOQT  being  taken 
by  Flight  Officer  applicants).  The  authors  conclude  “The  AFOQT  is  comprised  of  five  lower- 
order  factors:  verbal,  math,  spatial,  aircrew,  and  perceptual  speed  which  accounted  for  20%  of 
the  total  variance,  and  g  in  hierarchical  position  accounted  for  41%  of  the  total  variance. 
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Compared  with  the  ASVAB,  the  AFOQT  was  less  saturated  [with  g ]  but  had  more  common 
factors  and  had  a  greater  proportion  of  its  variance  associated  with  common  factors”  (p.  12). 

Carretta,  Retzlaff,  and  King  (1997)  compared  the  AFOQT  and  the  Multidimensional 
Aptitude  Battery  (MAB).  The  MAB  is  a  broad-based  test  of  intellectual  ability  patterned  after 
the  Wechsler  Adult  Intelligence  Scale  but  designed  for  group  administration.  The  sample  in  this 
study  was  approximately  2,200  USAF  aviator  candidates.  A  joint  factor  analysis  of  the  AFOQT 
and  the  MAB  revealed  that  each  battery  had  a  hierarchical  structure.  The  correlation  between  the 
higher-order  factors  from  the  two  batteries  was  .981,  indicating  that  both  measured  the  same 
thing,  which  these  authors  conclude  is  general  intelligence  (g). 

Ambler  and  Smith  (1974)  analyzed  data  for  the  seven  tests  of  the  Guilford-Zimmerman 
Aptitude  Survey,  the  Hidden  Figures  Test,  and  four  subtests  from  the  US  Navy-Marine  Corps 
aviation  selection  battery  [1)  Aviation  Qualification,  which  includes  reading,  math,  and  science 
questions  related  to  a  typical  college  experience;  2)  Mechanical  Comprehension;  3)  Spatial 
Apperception;  and  4)  Biographical  Inventory].  Scores  were  available  for  approximately  1,700 
aviation  trainees  (presumably  all  male,  given  that  the  study  was  published  in  1974).  The 
researchers  factor  analyzed  the  subtest  scores  in  the  total  sample  and  in  various  subsamples  and 
found  that  six  factors  appeared  consistently  across  samples,  which  they  labeled  Mechanical, 
Spatial  Manipulation,  Perceptual  Flexibility,  Verbal  Intelligence,  Numerical  Intelligence,  and 
Flight  Motivation. 

Martinussen  and  Torjussen  (1998)  factor  analyzed  scores  on  a  multi-aptitude  test  battery 
used  for  aviator  selection  into  the  Norwegian  Air  Force.  The  battery  is  administered  in  a  multi¬ 
stage  process.  Stage  1  includes  12  subtests  intended  to  measure  General  Intelligence,  Technical 
Comprehension,  and  Spatial  Ability.  Stage  2  includes  seven  subtests  intended  to  measure 
Simultaneous  Capacity  and  Orientation  Ability.  Finally,  Stage  3  includes  a  personality  inventory 
called  the  Defense  Mechanism  Test  (DMT)  which  is  described  as  a  measure  of  psychodynamic 
defense  mechanisms  and  was  developed  for  use  in  selecting  persons  into  high-risk  professions. 
Very  little  information  is  provided  about  any  of  the  subtests. 

The  authors  randomly  selected  450  applicants  from  the  applicant  pool  who  had  Stage  1 
and  Stage  2  scores,  and  factor-analyzed  the  scores  using  Principal  Component  Analysis  with 
Varimax  rotation.  The  tests  included  in  each  stage  were  factor-analyzed  separately.  Three 
factors,  labeled  Mechanical  Comprehension  and  Spatial  Ability,  Verbal  Ability,  and  Numerical 
Reasoning  accounted  for  61%  of  the  variance  in  the  Stage  1  tests  and  three  factors,  labeled 
Spatial  Ability,  Time  Estimation,  and  Perceptual  Speed  and  Coordination,  accounted  for  62%  of 
the  variance  in  the  Stage  2  tests. 

In  summary,  factor  analyses  of  several  aviator  selection  batteries  suggest  that  it  is 
possible  to  derive  a  hierarchical  general  intelligence  factor,  with  sub-factors  related  to  verbal 
ability,  numerical  ability,  mechanical  ability,  spatial  ability,  and  perceptual  speed/flexibility.  A 
general  psychomotor  factor,  with  some  specific  sub-factors  also  appears  when  the  test  battery 
explicitly  contains  psychomotor  tests. 

The  factor-analytic  work  in  the  aviator  selection  domain  is  consistent  with  research 
conducted  on  the  structure  of  human  abilities  that  is  not  entirely  based  on  military  or  aviator  test 
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data  (e.g.,  see  Fleishman  &  Mumford,  1988;  Lubinski  &  Dawis,  1992;  McHenry  &  Rose,  1988; 
Russell,  Reynolds,  &  Campbell,  1994).  It  is  worth  noting  that,  with  the  exception  of  one  subtest 
in  the  US  Navy-Marine  Corps  selection  battery  (Ambler  &  Smith,  1974),  the  aviator  selection 
batteries  included  in  the  factor-analytic  studies  described  above  did  not  include  measures  of  non- 
cognitive  traits.  It  is  not  particularly  surprising,  then,  that  no  underlying  non-cognitive  factors 
were  found. 

Models  of  Skill  Acquisition 

This  section  examines  more  closely  the  hierarchical  general  intelligence  factor  derived  by 
the  factor  analyses  described  above.  Specifically,  the  question,  “How  does  intelligence  relate  to 
skill  acquisition  during  flight  training?”  is  addressed. 

Ackerman  (1987;  1988;  1990)  developed  a  model  of  skill  acquisition  that  is  applicable  to 
the  development  of  piloting  skills.  The  theory  is  founded  on  the  concept  of  attentional  resource 
allocation,  that  is,  the  amount  of  attentional  resources  required  by  various  tasks  at  various  points 
in  time,  and  the  amount  of  attentional  resources  that  individuals  can  bring  to  bear  in  any  given 
situation.  Ackerman’s  model  divides  skill  acquisition  into  three  broad  phases,  with  a 
corresponding  type  of  ability  that  is  the  primary  predictor  of  performance  within  each  phase.  In 
Phase  I,  the  primary  learning  task  is  to  comprehend  the  new  task.  Declarative  knowledge  and 
general  intelligence  are  the  primary  predictors  of  performance  in  this  phase.  In  Phase  II,  the 
primary  learning  task  involves  integrating  the  cognitive  and  motor  processes  required  to  perform 
the  task.  In  this  phase,  knowledge  compilation  and  perceptual  speed  are  the  primary  predictors  of 
performance.  In  Phase  III,  task  performance  becomes  proceduralized  (or  automatic),  and  thus 
requires  fewer  attentional  resources.  Procedural  knowledge  and  psychomotor  abilities  are  the 
most  important  predictors  in  this  phase.  Tasks  vary  in  the  extent  to  which  they  can  be 
proceduralized.  In  Ackerman’s  terminology,  tasks  that  can  become  proceduralized  are  called 
consistent  tasks;  those  that  cannot  become  proceduralized  are  called  inconsistent  tasks. 

According  to  Ackerman’s  theory,  general  cognitive  ability  is  expected  to  be  most 
important  during  the  early  stages  of  skill  acquisition  for  all  tasks  and  to  remain  important  for 
inconsistent  tasks.  Processing  speed  is  expected  to  be  most  important  during  intermediate  stages 
of  learning  for  any  task.  Psychomotor  skills  will  become  increasingly  important  as  a  task 
becomes  better-learned,  but  may  only  outstrip  cognitive  ability  in  importance  for  inconsistent 
tasks.  Keil  and  Cortina  (2001)  found  confirmatory  evidence  for  the  relationship  between 
cognitive  ability  and  performance  on  consistent  and  inconsistent  tasks  but  did  not  find  support 
for  the  relationship  between  perceptual  speed  and  psychomotor  skills  and  consistent  and 
inconsistent  tasks.  Additional  research  by  Ackerman  and  colleagues  shows  that  both  ability  and 
non-ability  factors  (e.g.,  personality,  vocational  interests,  motivation,  and  self-concept)  play  a 
role  in  determining  performance  on  complex  (inconsistent)  tasks  (Ackerman,  Kanfer,  &  Goff, 
1995;  Ackerman  &  Woltz,  1994). 

Ree,  Carretta,  and  Teachout  (1995)  developed  a  causal  model  to  explore  the  role  played 
by  general  intelligence  (g)  and  prior  knowledge  of  flying  on  performance  during  aviator  training. 
The  measures  of  g  and  prior  flying  knowledge  were  based  on  AFOQT  composite  and  subtest 
scores  collected  at  the  time  of  application  to  flight  training.  Criterion  measures  included 
measures  of  job  knowledge  (academic  classroom  performance)  and  work  samples  (check  ride 
performance)  collected  at  various  points  during  a  53-week  training  program.  When  the  model 
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was  tested  in  a  large  sample  of  USAF  aviator  trainees  ( n  =  3,428  males),  the  authors  found  that  g 
directly  influenced  the  acquisition  of  flight  knowledge  both  prior  to  and  during  training  and 
indirectly  influenced  work  sample  performance  through  the  acquisition  of  job  knowledge.  Prior 
knowledge  of  flying  had  almost  no  influence  on  acquisition  of  job  knowledge  during  the 
academic  portions  of  aviator  training,  but  directly  influenced  performance  on  early  work  sample 
measures.  Early  work  sample  performance  was  very  strongly  related  to  later  work  sample 
performance.  Carretta  and  Ree  (1997b)  tested  the  same  model  in  a  sample  of  male  USAF 
aviators  ( n  =  3,369)  and  in  a  small  sample  of  female  USAF  aviators  ( n  =  59).  The  basic  model 
was  supported  and  appeared  to  work  similarly  for  males  and  females,  although  the  female 
sample  was  too  small  to  draw  any  strong  conclusions. 


Evidence  of  Predictive  Validity  for  Flight  Training  Performance 

An  enormous  number  of  validation  studies  have  been  conducted  in  the  aviator  selection 
domain  -  too  many  to  cover  in  this  report.  Fortunately,  several  meta-analyses  focusing  on  the 
validity  of  selection  tests  have  been  published.  In  all  the  studies,  measurement  method  is 
confounded  with  measurement  target  (KSAOs)  to  at  least  some  degree.  This  section  describes 
meta-analyses  addressing  validity  evidence. 

Damos  (1993)  Meta-Analysis 

The  first  meta-analysis  of  aviation  performance  predictors  was  published  by  Damos  in 
1993.  She  meta-analyzed  12  studies  that  involved  a  single-task  performance-based  measure,  for 
example,  tracking  or  dichotic  listening,  and  14  studies  that  involved  multiple-task  performance- 
based  measures,  that  is,  two  or  more  single-task  measures  administered  simultaneously,  such  as 
tracking  plus  dichotic  listening.  The  mean  correlation  (uncorrected)  between  single-task 
performance  and  flight  grades  was  .18  (n  =  5,378);  the  correlation  between  multiple-task 
performance  and  the  same  criterion  was  .23  ( n  =  6,920).  Moderator  analyses  suggested  that  the 
level  of  validity  for  multiple-task  performance-based  measures  depended  on  the  type  of  sample 
(military  versus  civilian)  and  level  of  flight  experience  (students  versus  fully-trained  aviators), 
with  higher  validity  in  studies  with  a  civilian  sample  or  with  a  fully-trained  aviator  sample. 

Hunter  and  Burke  (1994)  Meta-Analysis 

The  second  meta-analysis  was  published  by  Hunter  and  Burke  in  1994.1  They  reviewed 
200  studies  published  between  1940  and  1990  that  involved  aircrew  selection.  Sixty-nine  studies 
contained  one  or  more  usable  validity  coefficients,  and  the  authors  located  or  derived  468 
validity  coefficients  from  these  studies.  It  is  worth  noting  that  studies  reporting  only  a  composite 
score  based  on  a  multi-aptitude  test  battery  were  excluded  from  the  meta-analysis.  The  majority 
of  the  validity  coefficients  were  based  on  studies  conducted  in  the  US  (77%),  involving  a 
military  sample  (94%),  and/or  a  sample  that  was  training  to  fly  fixed-wing  aircraft  (86%).  Most 
of  the  studies  used  dichotomous  pass-fail  criterion  measures  (84%)  which,  as  noted  above,  places 


1  An  earlier  version  of  this  meta-analysis  was  also  published  in  Hunter  and  Burke  (1992).  The  general 
findings  are  the  same  in  the  two  versions,  but  the  specific  values  cited  for  various  predictor  types  is  not 
exactly  the  same. 
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a  ceiling  on  the  maximum  possible  correlation,  with  a  lower  ceiling  to  the  extent  the  criterion 
distribution  departs  from  a  50-50  split  (as  is  likely  the  case  in  virtually  all  of  the  studies). 

Hunter  and  Burke  (1994)  categorized  each  validity  coefficient  according  to  one  of  16 
predictor  types.  They  then  applied  bare-bones  meta-analytic  procedures  (Hunter  &  Schmidt, 
1990).  Table  1  is  adapted  from  a  table  of  results  published  in  Hunter  and  Burke  (1994).  It 
shows,  for  each  predictor  type,  the  mean  sample-weighted  validity  (uncorrected),  the  percentage 
of  variance  explained  by  sampling  error,  and  the  lower  bound  for  the  95%  confidence  interval.2 
The  predictor  type  with  the  highest  mean  sample-weighted  validity  is  “Job  Sample.”  The 
authors  do  not  describe  this  predictor  type,  but  one  might  speculate  that  it  includes  flight 
simulation  tests.  Mechanical  ability,  gross  dexterity,  reaction  time,  biodata  inventory,  and 
information  (General  or  Aviation)  predictors  also  showed  relatively  high  validity  and  a 
confidence  interval  that  did  not  include  zero.  Recall  that  “biodata”  is  a  measurement  method. 
There  is  no  way  of  determining,  from  this  meta-analytic  review,  what  KSAOs  were  measured. 

After  conducting  the  bare-bones  meta-analysis,  Hunter  and  Burke  applied  two  validity 
generalization  decision  rules:  1 )  Does  sampling  error  account  for  more  than  75%  of  the  variance 
in  observed  validities?  and  2)  Does  the  90%  credibility  limit  include  zero?  Answering  “no”  to 
the  first  decision  rule  allows  one  to  conclude  that  validity  is  generalizable  across  samples  and 
settings.  Answering  “no”  to  the  second  decision  rule  allows  one  to  conclude  that  the  true 
validity  in  the  population  is  greater  than  zero.  None  of  the  predictor  types  included  in  this  meta¬ 
analysis  met  the  first  decision  rule,  but  several  met  the  second.  For  these  predictor  types,  it  is 
reasonable  to  believe  that  the  true  validity  is  greater  than  zero  in  any  setting  or  sample,  but  the 
level  of  validity  may  vary  from  one  setting  or  sample  to  another:  Quantitative  Ability;  Spatial 
Ability;  Mechanical;  Aviation  Information;  General  Information;  Gross  Dexterity;  Perceptual 
Speed;  Reaction  Time;  Biodata  Inventory;  and,  Job  Sample. 


2  Hunter  and  Burke  (1994)  claim  that,  in  keeping  with  decision  rules  established  by  Hunter  &  Schmidt 
(1990),  they  calculated  and  used  the  90%  credibility  limit,  rather  than  the  95%  confidence  interval.  In  their 
table  of  results,  however,  they  report  the  95%  confidence  interval. 
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Table  1 


Hunter  and  Burke  (1994)  Meta-analytic  Results  for  Various  Predictor  Types 


Predictor  Type 

#of 

Correlations 

Total 

Sample 

Size 

Mean  r 

% 

Variance 

Explained 

by 

Sampling 

Error 

95%  Cl 
Lower 
Bound 

General  Ability 

14 

8,071 

.13 

21% 

-.05 

Verbal  Ability 

17 

22,841 

.12 

6% 

-.09 

Quantitative  Ability 

34 

46,884 

.11 

28% 

/  Jj 

Spatial  Ability 

37 

52,153 

.19 

14% 

.05 

Mechanical 

36 

42,418 

.29 

8% 

.11 

General  Information 

13 

29,951 

.25 

4% 

.06 

Aviation  Information 

23 

25,295 

.22 

12% 

.06 

Gross  Dexterity 

60 

48,988 

.32 

13% 

.15 

Fine  Dexterity 

12 

2,792 

.10 

45% 

-.09 

Perceptual  Speed 

41 

33,511 

19% 

.05 

Reaction  Time 

7 

10,633 

.28 

16% 

.16 

Biodata  Inventory 

21 

.27 

6% 

.07 

Age 

9 

1 

-.10 

11% 

-.25 

Education 

9 

6,163 

.06 

12% 

-.16 

Job  Sample 

16 

2,814 

.34 

37% 

.19 

Personality 

46 

22,486 

.10 

11% 

-.16 

Notes. 

1 .  Mean  r  is  weighted  by  sample  size,  but  has  not  been  corrected  for  any  other  artifacts. 

2.  When  analyzing  the  data,  validity  coefficients  were  reflected  for  predictor  types  that  would  be  expected  to 
show  a  negative  correlation  with  the  criterion  variable,  that  is,  those  involving  measures  of  speed.  Thus,  in  the 
table  above,  positive  correlations  indicate  that  better  performance  on  the  predictor  is  associated  with  better 
performance  on  the  criterion  measures. 
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According  to  Hunter  and  Burke  (1994),  the  following  predictor  types  may  show  non-zero 
validity  in  some  settings  or  samples: 

•  General  Ability 

•  Verbal  Ability 

•  Fine  Dexterity 

•  Age 

•  Education 

•  Personality  -  Recall  that  “personality”  is  a  measurement  method.  Across  studies, 
some  of  the  personality  scales  likely  were  expected  to  show  a  negative  correlation  with 
criterion  performance  (e.g.,  Anxiety),  while  other  scales  were  likely  expected  to  show  a 
positive  correlation  (e.g.,  Self-Confidence).  For  still  other  personality  scales,  there  likely 
was  no  clear  a  priori  expectation  about  the  direction  of  the  correlation  (e.g.,  Risk- 
Taking).  One  might  argue  that  averaging  across  all  the  different  types  of  scales  does  not 
provide  an  accurate  representation  of  the  true  level  of  validity  that  might  be  achieved  by 
measures  of  specific  personality  traits. 

Hunter  and  Burke  conducted  moderator  analyses  for  a  subset  of  the  predictor  types  for 
which  there  were  sufficient  data.  They  examined  four  possible  moderators:  1)  time  period  in 
which  the  study  was  conducted  (1940-1960  versus  1961-1990),  2)  nationality  of  the  study 
sample  (US  versus  other),  3)  service  branch  (Air  Force  versus  other),  and  aircraft  type  (fixed- 
wing  versus  rotary-wing).  The  most  consistent  finding  was  that  the  time  period  in  which  the 
study  was  conducted  moderated  the  validity  of  several  predictor  types,  with  lower  mean  validity 
in  more  recent  studies.  The  authors  speculate  that  the  decline  in  validity  over  time  could  be  due 
to  reduced  variability  in  the  applicant  pool,  more  extreme  splits  on  dichotomous  criterion 
measures  (e.g.,  farther  away  from  a  50-50  split  in  the  proportion  of  trainees  who  pass  versus  fail 
UPT),  or  changes  in  the  nature  of  aviator  training.  The  other  moderator  variables,  at  least  as 
coded  in  this  meta-analysis,  provided  very  little  explanatory  power. 

Martinussen  (1996)  Meta-Analysis 

The  third  meta-analysis  was  published  by  Martinussen  in  1996.  She  conducted  a 
standard  computerized  literature  search  and  also  made  a  special  effort  to  collect  unpublished 
validation  studies  focusing  on  military  aircrew  selection  from  researchers  in  NATO  countries. 
Studies  that  did  not  report  the  magnitude  of  nonsignificant  correlations  or  only  reported 
corrected  correlations  were  excluded.  (Hunter  and  Burke  do  not  say  how  they  handled  such 
studies).  Martinussen  reports  that  she  reviewed  1 34  studies,  and  located  66  independent  samples 
in  50  studies  that  met  her  criteria  for  inclusion.  Fifty  percent  of  the  studies  were  conducted  in 
the  United  States.  Most  samples  involved  military  aviators,  with  the  bulk  of  those  belonging  to 
the  Air  Force.  Two-thirds  of  the  studies  involved  fixed-wing  aviators  and  21%  involved  rotary¬ 
wing  aviators  (12%  did  not  specify  the  type  of  aircraft).  Twenty  (40%)  of  the  studies  were 
unpublished  material.  All  of  the  studies  used  performance  during  aviator  training  as  the  criterion 
variable  -  dichotomous  pass/fail  status,  instructor  ratings,  or  course  grades.  While  the 
distribution  of  study  types  is  similar  to  that  described  by  Hunter  and  Burke  (1994),  comparison 
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of  the  reference  lists  reveals  very  little  overlap  in  the  studies  included  in  each  review.  In  fact, 
fewer  than  20  studies  appeared  in  both  meta-analyses. 

Martinussen  categorized  each  predictor  measure  into  one  of  nine  measurement  methods 
(predictor  types).  Each  is  described  below: 

1 .  Cognitive  includes  all  tests  designed  to  measure  a  specific  type  of  cognitive  ability 
(e.g.,  mechanical,  spatial,  verbal,  quantitative). 

2.  Intelligence  includes  tests  specifically  designed  to  measure  global  intelligence. 

3.  Psychomotor/Information  Processing  includes  all  tests  involving  apparatus  or  a 
computer.  Obviously,  this  could  encompass  several  different  types  of  ability 
measures  (e.g.,  psychomotor  skills,  reaction  time,  etc.) 

4.  Aviation  information  includes  tests  with  questions  about  aviation.  Martinussen  points 
out  that  most  psychologists  interpret  such  tests  as  measures  of  motivation  to  become 
an  aviator. 

5.  Biographical  inventories  collect  background  information  about  applicants,  and  then 
summarize  the  information  according  to  a  total  score.  Although  Martinussen  does  not 
comment  on  the  nature  of  the  inventories,  it  is  likely  that  many  of  them  were 
empirically-scored. 

6.  Personality  tests  include  a  variety  of  personality  inventories.  The  data  are  not 
organized  according  to  personality  trait  but,  unlike  Hunter  and  Burke  (1994), 
Martinussen  did  attempt  to  take  the  expected  relationship  between  the  underlying 
scale  and  the  criterion  variable  into  account  by  reflecting  the  sign  of  the  correlation,  if 
needed,  based  on  information  in  the  original  study.  In  cases  where  no  expectation 
about  the  direction  of  the  relationship  could  be  derived  from  the  original  study, 
Martinussen  coded  the  absolute  value  of  the  correlation,  in  effect  making  it  positive. 
This  has  the  overall  effect  of  inflating  the  mean  validity  coefficient. 

7.  Combined  index  was  used  when  a  validity  coefficient  was  reported  only  for  a 
combination  of  predictor  measures.  Martinussen  does  not  report  how,  or  if,  she  took 
account  of  the  fact  that  such  measures  may  capitalize  on  chance,  for  example,  if  they 
were  created  using  a  regression  procedure.  (Hunter  and  Burke  excluded  these 
studies.) 

8.  Academics  includes  school  grades  or  tests  that  measured  mathematical  or  language 
proficiency. 

9.  Training  experience  includes  measures  of  flying  performance  prior  to  selection  into 
the  training  program  that  was  the  focus  of  the  study.  It  is  not  clear  if  these  included 
self-reported  or  verified,  objective  measures  of  prior  flight  hours/performance,  or 
both. 

Table  2  shows  the  number  of  correlations,  total  sample  size,  mean  sample-weighted 
correlation  (observed  and  corrected  for  dichotomization),  percent  variance  explained  by 
sampling  error,  and  90%  credibility  limit  for  each  measurement  method.  Using  decision  rules 
similar  to  those  applied  by  Hunter  and  Burke  (1994),  Martinussen  suggests  that  the  mean  validity 
of  the  Academics  measurement  method  ( r  =  .15)  is  likely  to  generalize  across  samples  and 
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settings,  given  that  70%  of  the  variance  in  observed  validity  is  explained  by  sampling  error.  For 
the  remaining  measurement  methods,  it  appears  that  there  may  be  moderator  variables  that 
impact  the  level  of  validity  across  settings  and  samples,  but  the  90%  credibility  limit  is  greater 
than  zero  for  all  but  two  of  them  (biographical  inventory  and  personality). 

Martinussen  (1996)  also  found  a  negative  correlation  between  year  of  study  publication 
and  validity  coefficients  for  each  type  of  predictor  measure  except  Training  Performance.  This 
is  consistent  with  the  finding  of  a  decline  in  validity  across  time  reported  by  Hunter  and  Burke 
(1994).  She  also  conducted  several  moderator  analyses.  Of  most  interest  for  the  present  effort 
was  her  finding  of  a  significant  difference  in  the  mean  validity  of  two  measurement  methods  - 
(general)  intelligence  and  training  experience  —  depending  on  type  of  aircraft.  General 
intelligence  tests  showed  higher  validity  in  samples  of  rotary-wing  aviators  (mean  uncorrected 
r  =  .27)  than  in  samples  of  fixed-wing  aviators  (mean  uncorrected  r  =  .1 1). 

Table  2 


Martinussen  (1996)  Meta-analytic  Results  for  Various  Measurement  Methods 


Measurement  Method 

#of 

Correlations 

Total 

Sample 

Size 

Mean  r 

% 

Variance 

Explained 

by 

Sampling 

Error 

90% 

Credibility 

Limit 

Cognitive 

35 

17,900 

.22 

(.24) 

12% 

.07 

Intelligence 

26 

15,403 

.13 

(.16) 

18% 

.03 

Psychomotor/Info  Processing 

29 

8,522 

28% 

.10 

Aviation  Information 

16 

3,736 

m 

46% 

.14 

Personality 

21 

6,304 

.13 

(.14) 

24% 

.00 

Biographical  Inventory 

13 

11,347 

.21 

(.23) 

4% 

.00 

Combined  Index 

14 

5,362 

.31 

(.37) 

13% 

.19 

Academics 

9 

4,267 

.15 

70% 

.11 

Training  Experience 

10 

5,806 

.25 

7% 

.07 

Note.  Mean  r  is  weighted  by  sample  size.  The  value  enclosed  in  parentheses  is  the  sample-weighted  mean  r 
corrected  for  criterion  dichotomization. 
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In  contrast,  training  experience  showed  higher  validity  in  samples  of  fixed-wing  aviators 
(mean  uncorrected  r  =  .35)  than  in  samples  of  rotary- wing  aviators  (mean  uncorrected  r  =  .12). 
The  latter  finding  may  be  due  to  the  fact  that  individuals  who  pursue  a  private  pilot’s  license 
prior  to  entering  a  formal  aviator  training  program  are  more  likely  to  do  so  in  a  fixed-wing 
aircraft.  As  a  consequence,  the  training  experience  may  more  directly  transfer  to,  and  thus 
positively  affect,  performance  in  fixed-wing  aviator  training  than  in  rotary-wing  aviator  training. 
This  finding  is  also  consistent  with  anecdotal  evidence  that  “too  much”  prior  experience  or 
training  in  fixed-wing  aircraft  can  be  detrimental  when  learning  to  fly  a  rotary-wing  aircraft. 

Martinussen  and  Torjussen  (1998)  Meta-Analysis 

The  fourth  meta-analysis,  conducted  by  Martinussen  and  Toijussen  (1998),  focused 
exclusively  on  a  test  battery  used  for  aviator  selection  into  the  Norwegian  Air  Force  (NAF). 

Four  studies  were  included,  with  two  to  five  independent  samples  for  each  of  1 9  subtests 
included  in  the  test  battery.  Sample  sizes  ranged  from  244  to  977  per  subtest.  In  all  four  studies, 
the  test  battery  was  used  in  the  aviator  selection  process  so  there  was  direct  restriction  of  range 
on  the  subtest  scores.  Furthermore,  spatial  abilities  were  measured  in  each  of  two  successive 
stages  of  the  battery,  albeit  with  different  tests.  As  a  consequence,  the  final  sample  was  highly 
restricted  in  terms  of  spatial  ability.  Criterion  measures  were  based  on  training  performance, 
primarily  pass/fail  status,  but  also  instructor  ratings  and  course  grades. 

Out  of  19  subtests,  10  showed  a  90%  credibility  limit  greater  than  zero.  The  mean 
uncorrected  validity  was  lower  than  .20  for  all  but  two  of  them  -  Aviation  Information  (mean 
uncorrected  r  =  .21)  and  Instrument  Comprehension  (mean  uncorrected  r  =  .26),  both  of  which 
were  administered  in  the  first  stage  of  testing.  Martinussen  corrected  the  validities  for 
dichotomization  of  the  criterion  measure  (when  appropriate),  but  did  not  correct  them  for  range 
restriction.  The  corrected  validities  are  consistently  somewhat  higher.  One  can  only  speculate 
how  high  they  might  be  if  corrected  for  range  restriction  as  well. 

Interestingly,  this  is  the  only  study  in  which  the  90%  credibility  limit  was  greater  than 
zero  for  a  personality  measure,  although  the  mean  validity  was  still  low  and  consistent  with  the 
level  reported  in  other  meta-analytic  reviews  (mean  r  =  .06  and  .12  for  two  non-independent 
scoring  methods  used  within  the  same  inventory).  According  to  Martinussen  and  Torjussen,  the 
personality  inventory  -  the  DMT  -  measures  psychodynamic  defense  mechanisms,  and  was 
specifically  developed  to  select  personnel  for  high-risk  professions.  The  Norwegian  Air  Force 
used  it  as  a  post-selection  screening  device  for  individuals  who  had  already  been  selected  into 
aviator  training. 

Summary  of  Meta- Analytic  Validation  Studies 

As  noted  above,  there  was  very  little  overlap  in  the  studies  included  in  three  of  the  meta- 
analytic  reviews.  (The  obvious  exception  is  that  all  four  of  the  studies  included  in  the 
Martinussen  and  Torjussen  (1998)  review  also  appeared  in  Martinussen’s  (1996)  broader 
review.)  Only  four  of  the  studies  reviewed  by  Damos  (1996)  appear  in  the  Hunter  and  Burke 
(1994)  citation  list,  and  only  one  appears  in  the  Martinussen  (1996)  citation  list.  Fewer  than  20 
references  appear  in  both  the  Hunter  and  Burke  (1994)  and  Martinussen  (1996)  reviews. 
Different  authors  also  categorized  the  predictor  measures  differently,  making  it  difficult  to 


19 


compare  the  results  from  different  reviews.  Nevertheless,  the  following  summary  statements  can 
be  made: 

•  Global  intelligence  tests  showed  about  the  same,  relatively  low  level  of  validity  in  the 
two  meta-analyses  in  which  they  were  included  (mean  uncorrected  r  =  .13),  with 
support  for  validity  generalizability  in  one  study  but  not  in  the  other. 

•  The  validity  of  specific  cognitive  ability  tests  seems  to  vary  depending  on  the  type  of 
ability  being  measured  but  tends  to  be  higher  than  that  of  more  global  measures  of 
intelligence.  This  statement  is  supported  by  the  mean  uncorrected  validity  of  .22  for 
the  cognitive  measurement  method,  as  opposed  to  the  mean  uncorrected  validity  of  .13 
for  the  global  intelligence  measurement  method,  as  reported  in  Martinussen  (1996).  It 
is  also  supported,  to  some  degree,  by  Hunter  and  Burke’s  finding  that  the  mean 
uncorrected  validities  for  two  specific  cognitive  ability  predictors  types,  Spatial  (r  = 
.19)  and  Mechanical  (r  =  .29),  are  higher  than  that  for  the  global  intelligence  predictor 
type  ( r  =  .13).  However,  the  mean  uncorrected  validity  of  verbal  and  quantitative 
ability  predictor  types  reported  by  Hunter  and  Burke  (r  =  .12  and  .11,  respectively)  is 
about  the  same  as  that  reported  for  the  global  intelligence  predictor  type.  This  finding 
may  be  at  least  partially  due  to  higher  content  overlap  between  global  intelligence  and 
verbal  and  quantitative  ability  tests  than  between  global  intelligence  and  spatial  or 
mechanical  ability  tests. 

•  There  is  some  evidence  that  Mechanical  ability  tests  are  among  the  more  valid 
measures  of  performance  during  aviator  training,  as  evidenced  by  the  mean  uncorrected 
validity  of  .29  in  the  Hunter  and  Burke  (1994)  meta-analysis  and  the  mean  uncorrected 
validity  of  .26  for  the  Instrument  Comprehension  subtest  in  the  Martinussen  and 
Torjussen  (1998)  meta-analysis.  (Factor  analyses  of  the  Norwegian  Air  Force  test 
battery  suggested  that  the  Instrument  Comprehension  measures  both  mechanical  and 
spatial  abilities.) 

•  Aviation  Information  tests  showed  about  the  same  level  of  validity  in  the  three  meta¬ 
analyses  in  which  they  were  included  -  about  .22  (uncorrected). 

•  The  biographical  inventory  measurement  method  showed  a  relatively  high  mean 
validity  in  the  two  meta-analyses  in  which  it  was  included  (mean  uncorrected  r  =  .27 
and  .21,  respectively).  However,  sampling  error  explained  very  little  of  the  variability 
in  validity  estimates  across  studies,  suggesting  that  there  are  other  factors  that  impact 
the  validity  of  such  inventories.  One  of  the  most  important  factors  may  be  the  extent  to 
which  the  inventory  was  designed  to  measure  KSAOs  relevant  for  the  aviator  job. 

•  At  least  some  types  of  psychomotor  and  information  processing  tests  are  likely  to 
exhibit  a  reasonable  level  of  validity  in  almost  any  sample  or  setting,  as  evidenced  by 
the  Damos  (1993)  finding  of  mean  uncorrected  validities  of  .18  and  .23  for  single-task 
and  multiple-task  performance-based  measures,  respectively.  This  is  supported  by  the 
range  of  mean  validities  from  .10  to  .32  for  measures  of  dexterity,  reaction  time,  and 
perceptual  speed  in  Hunter  and  Burke  (1994)  and  by  the  Martinussen  (1996)  mean 
validity  of  .20  for  psychomotor/information  processing  tests. 

•  Measures  of  spatial  ability  showed  a  mean  uncorrected  validity  of .  1 9  in  the  Hunter 
and  Burke  meta-analysis.  In  the  Norwegian  Air  Force  battery,  subtests  with  titles  that 
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appear  most  like  traditional  measures  of  spatial  ability  (Paper  Forming,  Rotating 
Patterns,  and  Figure  Pattern)  showed  very  low  validity.  However,  two  other  subtests 
that  contain  a  spatial  ability  component,  Raven’s  Matrices  and  Instrument 
Comprehension,  showed  much  higher  validity  (mean  uncorrected  r  =  .16  and  .29, 
respectively).  Stage  two  of  the  NAF  battery  also  includes  spatial  ability  tests,  and  they 
showed  very  low  validity,  but  this  could  be  due  to  the  extreme  restriction  of  range 
given  that  applicants  had  already  been  directly  screened  on  spatial  abilities  during  the 
first  stage  of  the  testing  process. 

•  Personality  measures,  in  general,  showed  low  validity  for  predicting  performance  in 
training.  However,  as  noted  above,  the  meta-analytic  reviews  did  not  calculate  the 
mean  validity  for  different  types  of  personality  traits,  and  averaging  across  scales  more 
and  less  relevant  for  the  aviator  job  likely  obscured  the  true  level  of  validity  that  such 
measures  can  achieve. 


Personality  Research  in  the  Aviator  Selection  Arena 

As  noted  earlier  in  this  report,  a  great  deal  of  research  has  been  conducted  in  the  area  of 
personality  measurement  for  use  in  aviator  selection,  with  contradictory  results.  Lambirth, 
Dolgin,  Rentmeister-Bryant,  and  Moore  (2003)  commented,  “The  US  Navy,  Air  Force,  and 
Army  have  investigated  a  variety  of  personality  tests  for  use  in  pilot  selection  batteries.  These 
efforts  have  had  little  impact  on  the  selection  of  pilots  or  other  aircrew  because  of  response  bias 
and  the  inappropriateness  of  the  clinical  measures  selected  for  a  homogeneous,  non-clinical 
population.  However,  personality  tests  that  emphasize  positive  attributes,  rather  than 
psychopathology,  and  performance-based  personality  measures,  have  proven  to  be  more  accurate 
descriptors  of  personality  and  predictors  of  performance”  (p.  416). 

Job  analyses  and  other  studies  suggest  that  non-cognitive  characteristics  are  important  for 
aviator  performance.  Musson,  Sandal,  and  Helmreich  (2004)  say,  “Superior  performance  [among 
pilots]  has  consistently  been  linked  to  a  personality  profile  characterized  by  a  combination  of 
high  levels  of  instrumentality  and  expressivity  along  with  lower  levels  of  interpersonal 
aggressiveness.  This  personality  profile  has  sometimes  been  referred  to  as  the  ‘Right  Stuff,’ 
suggesting  this  is  the  ideal  description  of  an  astronaut  or  pilot.  Inferior  performance  has  been 
linked  to  personality  profiles  typified  by  a  hostile  and  competitive  interpersonal  orientation  . . . 
(the  ‘Wrong  Stuff) ...  or  to  low  achievement  motivation  combined  with  passive-aggressive 
characteristics  (‘No  Stuff)”  (p.  342).  The  authors  point  out  that  these  profiles  seem  to  be 
especially  important  in  terms  of  working  as  part  of  a  crew. 

As  noted  above,  several  of  the  obstacles  to  conducting  good  research  in  the  aviator 
selection  domain  are  particularly  problematic  for  personality  measures.  These  include  the 
reliance  on  training  outcomes  as  the  criterion  measure  and  summarizing  results  by  averaging 
validity  estimates  across  several  different  personality  scales. 

Findings  from  General  Selection  Research  Literature 

There  is  a  great  deal  of  information  about  the  validity  of  measurement  methods  in  the 
general  selection  research  literature.  Cognitive  ability  tests  have  been  shown  to  predict  job 
performance,  particularly  technical  or  “can-do”  aspects  of  job  performance,  in  a  wide  variety  of 
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jobs  (Hunter  &  Hunter,  1984;  McHenry,  Hough,  Toquam,  Hanson,  &  Ashworth,  1990;  Schmidt 
&  Hunter,  1998;  Vernon,  1969).  Schmidt  and  Hunter  (1998)  collected  meta-analytic  evidence 
from  a  large  number  of  sources  and  summarized  it  according  to  different  types  of  personnel 
measures,  that  is,  measurement  methods,  for  predicting  performance  in  training  programs  and  for 
predicting  overall  job  performance.  Personnel  measures  with  the  highest  validity  for  predicting 
performance  in  job  training  programs  include  general  mental  ability  (GMA)  tests  (mean  r 
=  .56), 3  integrity  tests  (mean  r  -  .38),  peer  ratings  (mean  r  =  .36),  employment  interviews 
(structured  and  unstructured)  (mean  r  =  .35),  conscientiousness  tests  (mean  r  =  .30),  and 
biographical  data  (mean  r  =  .30).  Personnel  measures  with  the  highest  levels  of  validity  for 
predicting  overall  job  performance  include  work  sample  tests  (mean  r  =  .54),  GMA  tests  (mean 
r  =  .51),  structured  employment  interviews  (mean  r  -  .51),  peer  ratings  (mean  r  =  .49),  job 
knowledge  tests  (mean  r  =  .48),  training  and  education  ratings  (mean  r  =  .45),  job  tryout 
procedures  (mean  r  =  .44),  and  integrity  tests  (mean  r  =  .41).  As  noted  previously,  there  is  a 
high  degree  of  correspondence  between  method  and  target  of  measurement  for  some  of  the 
personnel  measures,  for  example  GMA  tests,  but  a  low  degree  of  correspondence  for  other 
personnel  measures,  for  example  employment  interviews. 

In  many  jobs,  technical  aspects  of  performance  are  not  the  only  aspects  that  matter  to  the 
organization.  For  example,  there  is  a  great  deal  of  interest  in  predicting  organizational 
citizenship  (Organ,  1994;  Organ  &  Ryan,  1995),  contextual  performance  (Borman  &  Motowidlo, 
1993),  and  “will-do”  aspects  of  job  performance  (Campbell,  Hanson,  &  Oppler,  2001).  To 
identify  attributes  that  underlie  these  non-technical  aspects  of  job  performance,  personnel 
selection  researchers  turned  to  the  vast  literature  on  non-cognitive  attributes.  Research  in  the 
personality,  biodata,  and  vocational  interest  domains  has  clearly  shown  that  measures  of  non- 
cognitive  attributes  can  predict  job  performance  (Barrick  &  Mount,  1991;  Gellatly,  Paunonen, 
Meyer,  Jackson,  &  Goffin,  1991;  Hunter  &  Hunter,  1984;  McHenry,  et  al.  1990;  Ones, 
Viswesvaran,  &  Schmidt,  1993;  Tett,  Jackson,  &  Rothstein,  1991),  particularly  when  a  careful 
effort  is  made  to  identify  and  measure  attributes  that  one  would  expect  to  underlie  different 
criterion  constructs,  and  when  the  presence  or  importance  of  those  criterion  constructs  is 
considered  for  different  types  of  jobs  (e.g.,  Hough,  1992;  Hough  &  Ones,  2002;  Hurtz  & 
Donovan,  2000;  Mount,  Barrick,  &  Stewart,  1998;  Ones  &  Viswesvaran,  2001a,  2001b;  Reilly  & 
Chao,  1982;  Robertson  &  Kinder,  1993).  Several  researchers  have  meta-analyzed  validity  for 
personality  measures,  using  the  “Big  5”  model  (Norman,  1963;  Tupes  &  Christal,  1961)  or  some 
other  model  (e.g.,  Hogan,  1991;  Hough,  Eaton,  Dunnette,  Kamp,  &  McCloy,  1990)  as  an 
organizing  structure.  One  well-established  finding  is  that  measures  of  conscientiousness  appear 
to  be  a  valid  predictor  of  job  performance  in  virtually  all  jobs  (Barrick  and  Mount,  1991; 

Schmidt  &  Hunter,  1998).  The  validity  of  other  personality  characteristics  seems  to  depend,  to  a 
greater  extent,  on  the  type  of  job.  For  example,  extraversion  appears  to  be  more  valid  for 
predicting  performance  in  sales  and  managerial  jobs  than  in  other  types  of  jobs  (Barrick  & 
Mount,  1991;  Hough,  1992). 

Finally,  there  is  evidence  that  vocational  interests,  for  example  interest  in  becoming  an 
aviator,  can  be  a  valid  predictor  of  relevant  job  outcomes.  It  is  generally  assumed  that  interest  in 


3  In  the  Schmidt  and  Hunter  (1998)  meta-analysis,  correlations  were  corrected  for  criterion  unreliability  and  range 
restriction  (if  present). 
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a  particular  occupation  will  lead  a  person  to  be  motivated  to  pursue  that  occupation  and 
motivated  to  gain  knowledge  about  it.  In  aviator  selection  research,  there  typically  has  not  been 
a  clear  distinction  between  measures  of  interests  and  measures  of  knowledge  or  background 
experience,  so  it  is  not  possible  to  estimate  the  likely  validity  of  a  stand-alone  self-report 
measure  of  interest  in  aviation.  There  is,  however,  evidence  from  the  US  Army’s  Project  A  that 
scores  on  a  self-report  vocational  interest  inventory  are  valid  for  predicting  technical  job 
performance  in  a  variety  of  Army  enlisted  military  occupations  (McHenry,  et  al.,  1990;  Oppler, 
McCloy,  Peterson,  Russell,  &  Campbell,  2001). 

Incremental  Predictive  Validity 

It  is  clear  from  the  information  described  in  the  preceding  section  that  an  aviator  selection 
battery  focusing  on  cognitive  ability  is  likely  to  be  a  valid  predictor  of  performance  in  training 
and  on  the  job.  So,  is  there  anything  to  be  gained  by  including  measures  of  other  KSAOs  in  the 
aviator  selection  process?  Research  suggests  that  there  is.  In  addition,  given  the  enormous  cost 
of  aviator  training,  even  a  small  increase  in  validity  can  offer  significant  utility  to  the  US  Army. 

Aviator  Selection  Research  Literature 

Most  of  the  research  on  this  topic  in  the  aviator  selection  arena  was  conducted  by  the 
USAF,  and  is  based  on  adding  the  Basic  Aviator  Test  (BAT)  to  the  AFOQT.  The  BAT  consists 
of  several  computer- administered  tests  measuring  psychomotor  skills,  short-term  memory,  time¬ 
sharing  ability,  and  attitudes  toward  risk-taking.  Across  several  studies,  the  BAT  demonstrated 
increases  in  the  amount  of  variance  accounted  for  (i.e.,  R2)  ranging  from  zero  to  .08  (e.g., 
Carretta,  1987b,  1988;  Carretta  &  Ree,  1996a),  with  higher  incremental  validity  for  criterion 
measures  other  than  training  pass-fail  (e.g.,  Advanced  Training  Recommendation  Board  [ATRB] 
ratings). 

Only  one  study  systematically  examined  the  incremental  validity  of  the  individual  BAT 
subtests  (Carretta  &  Ree,  1993).  This  study  found  that  a  BAT-Psychomotor  composite  score  and 
a  BAT-Risk  composite  score  each  (separately),  when  added  to  the  AFOQT-A viator  composite 
score,  increased  the  multiple  correlation  (R)  by  about  .04  for  predicting  pass/fail  and  class  rank. 
Adding  the  BAT  measure  of  Flying  Experience  to  the  AFOQT-A  viator  composite  score 
increased  R  by  about  .07  while  adding  BAT  Information  Processing  scores  did  not  increase  R 
significantly.  Adding  all  BAT  scores  to  the  AFOQT-A  viator  composite  score  increased  R  by 
approximately  .13  for  both  pass/fail  and  class  rank  (which  translates  into  an  increase  in  amount 
of  variance  accounted  for,  R2,  by  about  .02). 

In  research  that  did  not  involve  the  BAT,  Ree  (2004c)  found  that  two  dependent  variables 
derived  from  the  Test  of  Basic  Aviation  Skills  (TBAS)  increased  the  amount  of  variance 
accounted  for  in  basic  flight  training  performance  scores  by  .02  to  .03.  TBAS  includes  measures 
of  psychomotor  skills,  selective  attention,  spatial  ability,  and  noticing  and  responding  quickly 
and  appropriately  to  an  “emergency.”  The  report  does  not  specify  exactly  on  what  the  dependent 
variables  are  based,  but  they  appear  to  involve  psychomotor  and  spatial  aspects  of  test 
performance.  Retzlaff,  King,  and  Callister  (1995)  and  Carretta,  Retzlaff,  and  King  (1997)  report 
that  tests  of  aviation  interest/aptitude  included  in  the  AFOQT  have  been  shown  to  be  useful  for 
predicting  aviator  performance  beyond  measures  of  g  and  of  specific  cognitive  abilities  such  as 
verbal,  math,  spatial,  and  perceptual  speed. 
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Blower  and  Dolgin  (1990)  used  a  hierarchical  regression  model  to  examine  the 
incremental  validity  exhibited  by  three  tests:  (1)  Absolute  Difference-Horizontal  Tracking,  (2) 
Complex  Visual  Information  Processing,  and  (3)  Risk  Taking  in  predicting  success  in  primary 
flight  training  over  and  above  that  of  intelligence  and  demographic  variables.  Each  resulted  in 
approximately  a  3.5  %  increase  in  variance  explained.  This  study  also  found  that  a 
psychomotor/dichotic  listening  test,  a  Manikin  test  (a  mental  rotation  task),  and  a  Baddeley  test 
(an  assessment  of  working  memory)  did  not  add  incremental  validity. 

Several  other  studies  used  a  regression  approach  to  examine  the  validity  of  various  types 
of  predictor  measures  for  predicting  undergraduate  training  performance  (Bartram  &  Dale,  1985; 
Carretta,  1989, 1990;  Morrison,  1988;  Olea  &  Ree,  1994),  but  did  not  report  incremental 
validity.  However,  the  studies  did  report  that  psychomotor,  spatial  orientation,  biographical  data, 
working  memory,  and  to  a  lesser  degree  personality,  all  predicted  at  least  some  unique  variance 
in  undergraduate  aviator  training. 

General  Selection  Research  Literature 

In  the  general  selection  research  literature,  Schmidt  and  Hunter  (1998)  meta-analytically 
derived  an  estimate  of  the  incremental  validity  likely  to  occur  when  any  of  several  personnel 
measures  were  added  to  a  measure  of  general  mental  ability  for  predicting  a)  performance  in  a 
training  program  or  b)  overall  job  performance.  For  predicting  performance  in  a  training 
program,  their  results  show  that  the  greatest  incremental  validity  can  be  achieved  by 
supplementing  a  measure  of  general  mental  ability  with  an  integrity  test  or  a  conscientiousness 
test  (increase  in  multiple  R  of  .1 1  and  .09,  respectively).  For  predicting  overall  job  performance, 
the  greatest  incremental  validity  can  be  achieved  by  supplementing  a  measure  of  general  mental 
ability  with  an  integrity  test  (increase  in  validity  of  .14),  a  conscientiousness  test  (increase  in 
validity  of  .12),  a  work  sample  test  (increase  in  validity  of  .12),  or  an  employment  interview 
(increase  in  validity  of  .09). 

Other  research  has  also  shown  that  measures  of  non-cognitive  attributes  can  provide 
incremental  validity  beyond  measures  of  cognitive  attributes.  This  is  especially  true  for 
predicting  non-technical  aspects  or  “will-do”  aspects  of  job  performance  (Day  &  Silverman, 
1989;  Mount,  Witt,  &  Barrick,  2000;  Ones  &  Viswesvaran,  2001b;  Oppler,  et  al.,  2001; 
Robertson  &  Kinder,  1993;  Russell,  Mattson,  Devlin,  &  Atwater,  1990;  Salgado,  1998) 

However,  incremental  validities  have  also  been  found  in  the  vocational  interest  domain,  even  for 
cognitive  (or  “can-do”)  aspects  of  job  performance  (Gellatly,  et  al,  1991;  Hough,  Barge,  & 
Kamp,  2001). 

Summary  of  Incremental  Validity  Evidence 

Based  on  the  research  evidence  described  above,  there  is  reason  to  believe  that  measures 
of  the  following  constructs  may  add  incremental  validity  beyond  that  achieved  by  a  battery  that 
reliably  and  accurately  measures  general  intelligence: 

•  Psychomotor  skills 

•  Working  memory 

•  Aviation  interest/knowledge 
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•  Flying  experience  -  although  the  type  of  flying  experience  may  make  a  difference 
(fixed-wing  versus  rotary-wing) 

•  Personality  (including  factors  such  as  conscientiousness  and  risk-taking) 


Group  Differences 

As  mentioned  previously,  the  Army  aviation  applicant  sample  has  historically  been 
homogeneous,  that  is,  relatively  young,  male,  and  Caucasian,  with  at  least  a  high  school  degree 
and  usually  with  some  post-high  school  education.  Some  applicants  already  have  or  are  working 
on  a  private  pilot’s  license,  but  very  few  are  already  certified  to  fly  rotary-wing  aircraft.  Some 
are  already  in  the  military,  while  others  come  from  the  civilian  population.  In  fact,  the  Army  is 
the  only  branch  of  the  US  military  that  allows  civilians  and  military  enlisted  personnel  to  apply 
for  slots  in  the  aviation  training  program.  (All  branches  allow  military  Commissioned  Officers 
to  apply  for  aviator  training.)  In  the  Army,  those  applicants  who  are  accepted,  but  who  are  not  a 
Commissioned  Officer  at  the  time  of  application,  must  complete  Warrant  Officer  training  before 
they  enter  aviator  training. 

In  the  future,  it  is  likely  that  the  applicant  population  will  become  more  diverse  in  terms 
of  race  and  gender  but,  barring  major  policy  changes,  will  likely  not  become  more  diverse  on  the 
other  characteristics  listed  above.  One  of  ARI’s  objectives  is  to  minimize,  to  the  extent  possible, 
adverse  impact  exhibited  by  the  new  aviator  selection  battery.  The  level  of  adverse  impact 
exhibited  by  a  test  battery  depends  on  various  factors,  including  the  selection  ratio,  the  general 
characteristics  of  the  applicant  population,  and  placement  of  the  pass-fail  cutoff  on  a  test  battery. 
While  the  adverse  impact  cannot  be  estimated  at  this  point,  the  research  that  might  help  to 
anticipate  how  race  and  gender  subgroups  are  likely  to  score  on  an  aviator  selection  test  battery 
can  be  examined. 

Cognitive  Ability  Tests 

Research  conducted  on  cognitive  ability  tests  using  military  and  civilian  samples  suggests 
there  will  be  mean  score  differences  on  most  cognitive  ability  tests  when  racial  groups  are 
compared,  but  that  the  tests  will  not  be  unfair  to  any  racial  subgroup  (Campbell,  1996;  Carretta 
&  Ree,  2000;  Roth,  Bevier,  Bobko,  Switzer,  &  Tyler,  2001 ;  Russell,  Reynolds,  &  Campbell, 
1994;  Sackett,  Schmitt,  Ellingson,  &  Kabin,  2001;  Toquam,  Corpe,  &  Dunnette,  1989;  Wise, 
Welsh,  Grafton,  Foley,  Earles,  Sawin,  &  Divgi,  1992).  Research  suggests  that  there  will  be  a 
standardized  mean  score  difference  of  0.6-1 .0  between  African-Americans  and  Whites,  with 
Whites  scoring  higher  on  average.  Other  evidence  suggests  that  the  Hispanic- White 
standardized  mean  score  difference  will  be  about  half  as  large  as  the  African  American-White 
subgroup  difference,  again  with  the  White  mean  being  higher.  Finally,  Asian  subgroups 
sometimes  earn  a  higher  mean  score  than  the  White  subgroup,  and  sometimes  earn  a  lower  mean 
score.  Many  different  interpretations  of  and  explanations  for  these  findings  have  been  offered 
(e.g.,  educational  differences,  subtle  or  overt  racism,  cultural  bias),  but  no  one  has  yet  found  a 
way  to  entirely  explain  or  eliminate  the  differences,  and  efforts  to  ameliorate  the  differences 
have  met  with  limited  success  (Sackett,  et  al.,  2001 ;  Schmitt,  Sackett,  &  Ellingson,  2002). 

Research  suggests  that  there  are  no  gender  differences  in  general  cognitive  ability  (g),  but 
that  gender  differences  will  appear  on  specific  types  of  cognitive  ability  tests.  Specifically, 
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females  tend  to  perform  better  than  males  on  tests  of  verbal  ability  and  more  poorly  than  males 
on  tests  of  spatial,  mathematical,  and  mechanical  abilities  (Geary,  Saults,  Liu,  &  Hoard,  2000; 
Maccoby  &  Jacklin,  1974;  Maitland,  Intrieri,  Schaie,  &  Willis,  2000;  Weiss,  Kemmler, 
Deisenhammer,  Fleischhacker,  &  Delazer,  2003;  Wise,  at  al.,  1992).  Burke  (1995)  meta- 
analyzed  gender  subgroup  differences  on  aviator  aptitude  tests  and  reported  findings  consistent 
with  those  from  the  general  literature.  As  with  the  race  subgroup  differences,  a  variety  of 
explanations  have  been  offered  for  these  findings,  for  example,  differences  in  socialization 
experiences,  but  no  one  has  fully  explained  or  eliminated  them  to  date. 

The  magnitude  of  gender  differences  on  spatial  ability  tests  appears  to  vary  considerably 
with  the  type  of  test  (Linn  &  Peterson,  1985),  with  the  largest  differences  occurring  on  tests  that 
involve  three-dimensional  spatial  rotation  and  the  smallest  differences  occurring  on  tests  that 
involve  spatial  visualization  (e.g.,  paper-folding  tests).  Boer  (1991)  reviewed  construct  validity 
evidence  for  a  variety  of  spatial  ability  tests  and  concluded  that  “the  most  important  aspects  of 
spatial  ability  are  the  identification  of  the  optimal  solution  strategy  and,  perhaps,  a  final  process 
called  evaluation  and  confirmation.  It  seems  that  the  actual  execution  of  the  solution  process, 
including  mental  rotation,  is  less  important.”(p.  108). 

The  US  Army’s  Project  A  included  several  different  spatial  ability  tests.  Factor  analyses 
suggested  that  all  the  tests  load  on  a  single  underlying  factor,  but  that  some  of  the  tests  produced 
much  larger  race  and  gender  subgroup  differences  than  others  (Russell  &  Peterson,  2001). 
Specifically,  a  spatial  abilities  test  called  Assembling  Objects  showed  smaller  gender  differences 
than  other  spatial  ability  tests  but  was  a  valid  predictor  of  behavior.  A  similar  pattern  of 
findings,  using  most  of  the  same  spatial  ability  tests  developed  during  Project  A,  occurred  in  a 
large-scale  study  focused  on  revising  the  ASVAB  (Russell,  Reynolds,  &  Campbell,  1994). 

These  researchers  recommended  adding  the  Assembling  Objects  subtest  to  the  ASVAB,  a 
recommendation  that  has  since  been  enacted. 

Psychomotor  Tests 

Males  typically  score  considerably  higher  on  psychomotor  tests  than  females  (Burke, 
1995;  Carretta,  1997b;  McHenry  &  Rose,  1988;  Russell  &  Peterson,  2001)  and  the  standardized 
mean  score  difference  is  often  larger  than  1 .0.  There  is  much  less  reported  evidence  for  race 
subgroup  differences  on  psychomotor  tests  but  Russell  and  Peterson  (2001)  found  standardized 
mean  score  differences  ranging  from  .38  to  .87  between  African  American  and  White  enlisted 
personnel  on  Project  A  psychomotor  tests.  This  finding  may  be  at  least  partially  explained  by 
the  correlation  between  psychomotor  and  cognitive  abilities  (Carretta,  1997a;  Ree  &  Carretta, 
1992). 

Speeded  Information  Processing  Tests 

In  Project  A  and  in  a  joint-services  project  (Russell  &  Peterson,  2001 ;  Russell,  Reynolds, 
&  Campbell,  1994),  there  were  small  to  no  race  or  gender  subgroup  differences  on  speeded 
measures  of  information  processing,  for  example,  reaction  time.  Interestingly,  males  tended  to 
perform  somewhat  better  than  females  on  measures  that  focus  only  on  perceptual  speed,  while 
the  reverse  was  true  for  measures  that  focused  on  both  speed  and  accuracy.  Both  Carretta 
(1997b)  and  Burke  (1995)  report  similar  findings  when  examining  gender  differences  in 
performance  in  samples  of  USAF  and  UK  Royal  Air  Force  aviator  applicants  respectively. 
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Personality  and  Temperament  Measures 

Research  suggests  that  personality  and  temperament  measures  typically  show  small  or  no 
racial  subgroup  differences  (Bobko,  Roth,  &  Potosky,  1999;  Hough,  1998;  Ones  &  Viswesvaran, 
1998;  Russell  &  Peterson,  2001;  Schmitt,  Rogers,  Chan,  Sheppard,  &  Jennings,  1997).  In 
contrast,  there  often  are  gender  differences  on  personality  and  temperament  inventories  and, 
prior  to  passage  of  the  Civil  Rights  Act  of  1991,  many  test  batteries  used  separate  within-group 
norms  for  scoring  and  reporting  purposes,  that  is,  separate  norms  for  males  and  females,  and  for 
persons  of  different  racial  backgrounds.  The  Civil  Rights  Act  of  1991  prohibits  adjusting  scores, 
using  different  cutoffs,  or  otherwise  altering  the  results  of  employment  related  tests  on  the  basis 
of  race,  color,  religion,  sex,  or  national  origin.  As  a  consequence,  the  use  of  within-group  norms 
has  essentially  disappeared  for  cognitive  ability  tests.  In  the  personality  measurement  arena,  it 
appears  that  within-group  norms  are  generally  accepted  when  the  results  will  be  used  for 
descriptive  or  diagnostic  purposes  (e.g.,  in  a  counseling  setting)  but  are  much  more  controversial 
if  the  results  will  be  used  for  employment  related  decisions  (see  Sackett  &  Wilk,  1994). 

While  research  suggests  there  are  practically  meaningful  subgroup  differences  between 
males  and  females  on  personality  or  temperament  inventories,  the  direction  and  size  of  the 
difference  depends  on  which  personality  or  temperament  characteristic  is  being  measured 
(Sackett  &  Wilk,  1994)  and,  even  then,  is  not  always  consistent.  Furthermore,  very  little 
research  has  been  conducted  to  determine  whether  or  not  these  differences  lead  to  differential 
prediction  of  job  performance.  Saad  and  Sackett  (2002)  analyzed  data  from  the  US  Army’s 
Project  A  and  found  some  evidence  that  personality  scores  over-predicted  female  performance, 
but  no  evidence  of  bias. 

Sackett  and  Wilk  (1994)  reviewed  male-female  effect  sizes  on  the  scales  of  several  well- 
known  personality  inventories.  The  results  are  difficult  to  summarize  across  inventories  because, 
when  the  inventories  were  developed,  there  was  no  common  set  of  scale  labels  and  no  agreed- 
upon  set  of  underlying  constructs.  Generally,  males  scored  higher  than  females  on  scales 
measuring  dominance,  independence,  aggression,  and  risk-taking,  while  females  scored  higher 
than  males  on  scales  measuring  nurturance,  agreeableness,  affiliation,  and  conscientiousness. 
Many  of  the  differences  were  not  large,  however. 

The  temperament  inventory  developed  as  part  of  Project  A  (the  Assessment  of 
Background  and  Life  Experience  -  ABLE)  was  developed  with  the  intention  of  using  the  same 
set  of  norms  for  both  males  and  females.  In  a  large  sample  of  enlisted  US  Army  Soldiers,  the 
Male-Female  effect  sizes  ranged  from  .00  to  .54  across  the  1 1  ABLE  content  scales.  With  the 
exception  of  the  Physical  Condition  scale,  all  of  the  Male-Female  effect  sizes  were  .25  or  lower. 
Female  Soldiers  scored  at  least  somewhat  higher,  on  average,  than  males  on  cooperativeness, 
conscientiousness,  non-delinquency,  traditional  values,  work  orientation,  internal  locus  of 
control,  and  energy  level.  Male  Soldiers  scored  at  least  somewhat  higher,  on  average,  than 
females  on  emotional  stability,  self-esteem,  dominance,  and  physical  condition  (Russell  & 
Peterson,  2001).  Finally,  Ones  and  Viswesvaran  (1998)  meta-analyzed  subgroup  differences  on 
overt  integrity  tests  and  found  that  females  scored  .16  standard  deviations  higher  than  males. 
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What  Should  the  Army  Measure? 


Research  focused  specifically  on  aviator  selection,  as  well  as  general  research,  clearly 
suggests  that  cognitive  ability,  or  general  intelligence  (g),  will  be  an  important  predictor  of 
aviator  performance.  Researchers  debate  the  usefulness  of  identifying  more  specific  abilities 
within  the  cognitive  ability  domain  (e.g.,  Jensen,  1993;  Ree  &  Earles,  1992,  1993;  Schmidt  & 
Hunter,  1993;  Sternberg  &  Wagner,  1993),  but  many  personnel  selection  batteries  include 
measures  of  different  types  of  cognitive  ability,  including  some  combination  of: 

1 .  General  Reasoning; 

2.  Spatial  Ability; 

3.  Mechanical  Reasoning; 

4.  Quantitative  Ability; 

5.  Verbal  Ability; 

6.  Multiple-Task  Performance  (also  known  as  Timesharing  or  Divided  Attention);  and, 

7.  Information  Processing  (e.g.,  perceptual  speed  and  accuracy,  working  memory, 

cognitive  task  prioritization). 

Research  also  suggests  that  including  measures  of  the  following  abilities  and 
characteristics  is  also  likely  to  enhance  the  validity  of  the  overall  selection  process: 

8.  Aviation  or  Helicopter  Knowledge 

9.  Interest  in  Aviation 

10.  Flying  Experience  -  although  the  type  of  flying  experience  may  make  a  difference 

(fixed-wing  versus  rotary-wing) 

1 1 .  Normal-Range  Personality  Characteristics  —  Based  on  the  aviator  selection  and 

general  research  literature,  traits  that  seem  relevant  for  the  aviator  job 
include: 

a.  Conscientiousness/Integrity; 

b.  Achievement  Orientation; 

c.  Stress  Tolerance/Emotional  Stability; 

d.  Adaptability/Cognitive  Flexibility; 

e.  Interpersonal/Crew  Interaction  skills; 

f.  Risk  Tolerance; 

g.  Internal  Locus  of  Control;  and, 

h.  Dominance/Potency  (including  Self-Confidence/Self-Esteem). 

These,  then,  are  KSAOs  that  should  be  included,  at  a  minimum,  in  a  job  analysis  study. 
Some  of  them  may  be  defined  more  narrowly  than  shown  here,  based  on  taxonomic  work  in  the 
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field  of  individual  differences.  There  are  other  areas  that  should  be  considered  in  the  overall 
Army  aviator  selection  process,  for  example,  screening  for  serious  medical  conditions, 
neurological  deficiencies,  or  psychological  disorders.  However,  these  measures  would  be 
designed  to  “select  out”  applicants  that  do  not  belong  in  Army  aviation  training,  while  the 
present  project  is  intended  to  identify  those  batteries  that  would  be  useful  in  “selecting  in”  the 
most  qualified  applicants. 

Review  of  Existing  Aviator  Selection  Test  Batteries 

Even  using  a  focused  approach  to  the  literature  review,  more  than  1 50  potentially 
relevant  articles  were  identified.  Rather  than  rely  entirely  on  a  narrative  summary,  a  spreadsheet 
was  developed  to  summarize  standard  information  about  various  test  batteries  and  to  facilitate 
comparison  of  the  test  batteries  when  deriving  a  recommended  selection  strategy.  The  following 
questions  were  identified  as  potentially  having  some  bearing  on  testing  recommendations: 

1 .  What  subtests,  if  any,  are  part  of  the  battery? 

2.  Who  uses  the  battery? 

3.  How  long  does  it  take  to  administer  the  battery? 

4.  Is  the  battery  already  computerized?  Web-enabled? 

5.  Does  the  battery  require  non-standard  equipment  (e.g.,  joystick,  timing  card)? 

6.  What  validity  evidence  is  available  for  the  battery? 

7.  What  are  the  key  studies  and  references  describing  validation  efforts? 

An  answer  for  each  question  was  provided  (when  possible)  for  a  number  of  current  or 
recently  available  test  batteries,  and  documented  in  the  aforementioned  spreadsheet.  The  results 
are  shown  in  Appendix  A. 

When  considering  potential  measures  of  non-cognitive  characteristics  such  as 
conscientiousness,  it  is  clear  that  no  single,  existing  inventory  measures  every  characteristic  that 
might  be  important  for  the  aviator  job.  Therefore,  a  second  spreadsheet  was  created,  shown  in 
Appendix  B,  to  summarize  information  available  for  several  inventories  that  have  been 
administered  in  an  aviator  selection  setting,  or  that  are  already  owned  by  the  US  military.  This  is 
not  intended  to  be  a  comprehensive  review  of  all  possible  personality  inventories.  There  are 
dozens  of  commercially-available  personality  and  biodata  inventories  that  could  be  used  to 
measure  characteristics  important  for  aviators,  but  none  that  are  specifically  designed  for  aviator 
selection  and  none  that  appear  to  be  significantly  more  comprehensive  or  likely  to  exhibit 
significantly  higher  validity  than  those  already  available  to  the  US  military. 


Selection  Strategy  Recommendations 

The  following  approach  was  used  to  develop  a  recommended  selection  strategy: 

1 .  Identify  KSAOs  important  for  the  aviator  job  through  the  literature  review  and  a  job 
analysis.  In  other  words,  take  a  construct-oriented  approach  to  this  effort. 
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2.  Identify,  to  the  extent  possible,  existing  measures  of  those  KSAOs  with  known 
validity. 

3.  Construct  a  set  of  recommendations  outlining  best-bet  choices  for  predictors  that 
measure  critical  KSAOs,  taking  into  account  what  is  known  about  expected  subgroup 
differences. 

As  noted  above,  one  of  the  primary  considerations  in  recommending  an  existing  test 
battery  is  whether  or  not  there  is  validity  evidence  to  support  its  use.  Based  on  the  literature 
review,  and  as  summarized  in  Appendix  A,  there  are  several  aviator  selection  batteries  that  have 
demonstrated  a  reasonable  level  of  validity  for  predicting  Undergraduate  Pilot  Training,  which  is 
typically  operationalized  as  pass/fail,  but  sometimes  also  includes  measures  of  training  grades  or 
instructor  aviator  ratings.  Almost  no  one  has  attempted  to  predict  aviator  performance  outside  of 
training  and,  when  they  have,  have  not  had  a  great  deal  of  success.  This  is  likely  due  to  issues 
with  criterion  quality,  as  there  are  significant  obstacles  to  developing  good  measures  of  aviator 
performance  on  the  job. 

The  most  viable  candidates  for  replacing  the  AFAST  appear  to  be  test  batteries  developed 
by  the  US  military,  several  of  which  are  described  below.4  There  are  also  some  aviator  selection 
batteries  developed  by  foreign  military  services  or  commercial  organizations  with  demonstrated 
evidence  of  validity,  as  shown  in  Appendix  A.  The  latter  test  batteries  are  less  viable  than 
batteries  developed  by  the  US  military  because:  1)  there  is  no  evidence  that  they  are  any  more 
valid  than  test  batteries  developed  by  the  US  military,  and  2)  it  would  likely  be  difficult  and/or 
expensive  for  the  US  Army  to  gain  access  to  them. 

The  recommendations  resulting  from  this  review  are  presented  in  overview  form  in 
Appendix  C.  It  would  seem  to  be  an  efficient  use  of  Army  testing  resources  to  create  a  two-stage 
testing  process.  The  first  stage  would  include  measures  of  cognitive  abilities  such  as  spatial 
ability,  mechanical  reasoning,  verbal  ability,  numerical  reasoning,  and  perceptual  speed  and 
accuracy,  as  well  as  a  measure  that  would  attempt  to  tap  motivation  to  become  an  aviator,  such 
as  an  Information  subtest.  The  Army  may  be  able  to  take  advantage  of  the  fact  that  the  US  Navy 
has  a  web-enabled  aviator  selection  battery  that  currently  consists  of  a  reasonable  set  of  cognitive 
tests  and  an  Information  subtest  that  assesses  aviation  and  nautical  knowledge.  Including  a  non- 
cognitive  inventory  in  Stage  1  is  also  recommended.  Such  an  inventory  may  provide 
incremental  validity  beyond  the  cognitive  test  battery,  and  it  may  help  ameliorate  race  subgroup 
differences  on  the  cognitive  tests.  The  inventory  could  include  scales  from  several  different 
inventories  that  have  been  developed  by  the  US  military.  The  Stage  1  battery  could  be 
administered  via  the  Internet  on  any  standard  desktop  computer  and  would  not  require  any  non¬ 
standard  peripherals  or  hardware.  Thus,  it  could  be  administered  virtually  anywhere  that  a 
computer  is  available,  along  with  a  reliable  Internet  connection  and  a  test  control  officer. 


4  Re-using  any  of  the  existing  AFAST  subtests  was  not  considered  because  Army  researchers  believe  that  the 
content  may  have  been  compromised  over  the  several  years  in  which  it  has  been  used. 
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The  second  stage  of  the  test  battery  would  focus  on  psychomotor  skills  and  multiple-task 
performance.  These  types  of  tests  are  often  combined  and  labeled  “performance-based” 
measures.  Alternately,  given  practical  considerations,  this  Stage  2  test  battery  might  assist  in  the 
classification  of  selected  Army  aviators  into  mission/aircraft  types. 

Best  Bet  Predictor  Measures 


Stage  1:  Cognitive  Measures 

Aviator  Selection  Test  Battery  (ASTB).  The  ASTB  is  the  US  Navy’s  primary  aviator 
selection  instrument.  It  grew  out  of  the  Pensacola  1000  Pilot  Study,  which  examined  over  60 
psychological,  psychomotor,  and  physical  tests  (North  &  Griffin,  1 977).  The  current  version  of 
the  ASTB  includes  subtests  measuring  Reading  Comprehension,  Mathematical  Ability, 
Mechanical  Comprehension,  Spatial  Apperception,  and  Aviation  and  Nautical  Interests.  Navy 
researchers  are  currently  building  an  adaptive  version  of  the  ASTB  and  anticipate  transitioning  to 
adaptive  testing  within  three  to  five  years. 

The  ASTB  subtests  are  used  to  create  several  composite  scores,  including  the  Academic 
Qualification  Rating  (AQR)  and  the  Pilot  Flight  Aptitude  Rating  (PFAR).  Validity  data  for 
FY98-FY04  are  summarized  or  reported  in  graphic  form  in  a  series  of  ASTB  Workshop  Briefing 
Slides  (Operational  Psychology  Department,  20  July  2004).  Navy  researchers  found  that  AQR 
scores  predict  performance  in  ground  school  (r  =  .46  for  USN  student  aviators  and  r  =  .39  for 
USMC  student  aviators)  while  PFAR  scores  predict  performance  in  primary  flight  school 
(Primary  NSS)  (r  =  .32  USN  student  aviators  and  r  =  .21  for  USMC  student  aviators).  This 
research  suggest  reported  that  validity  for  predicting  attrition  from  aviator  training  was  in  the 
high  teens  for  student  aviators  in  both  the  USN  and  the  USMC,  using  AQR  or  PFAR  scores. 
Sample  sizes  were  not  provided  in  the  briefing  slides,  but  include  thousands  of  cases  (personal 
communication,  Captain  John  Schmidt,  Operational  Psychology  Department,  USN,  October  29, 
2004). 


The  US  Navy  has  developed  a  web-administration  system  for  the  ASTB,  called 
Automated  Pilot  Examination  (APEX).  The  APEX  system  is  being  widely  used  throughout  the 
Navy  and  is  expected  to  account  for  the  bulk  of  ASTB  administrations  by  the  end  of  FY  2005 
(personal  communication,  Captain  John  Schmidt,  USN,  October  29,  2004).  This  is  currently  the 
only  web-administered  aviator  selection  test  battery. 

The  ASTB  was  designed  to  select  Commissioned  Officers  who  will  enter  training  to 
become  a  Navy  or  USMC  aviator.  All  Commissioned  Officers,  by  definition,  have  completed  a 
four- year  college  degree.  Therefore,  to  ensure  that  the  ASTB  is  not  too  difficult  for  the  Army 
aviator  applicant  population  (which  includes  persons  with  less  than  a  four-year  college  degree), 
the  US  Navy  administered  the  ASTB  to  a  sample  of  incoming  Army  student  aviators.  In 
February  2005,  the  Operational  Psychology  Department  of  the  Naval  Operational  Medicine 
Institute  administered  a  paper-and-pencil  version  of  the  ASTB  to  73  student  aviators  at  Ft. 
Rucker,  AL.  The  Navy  scored  the  data  and  provided  summary  information  to  the  ARI  monitor 
and  PDRI  project  team.  There  was  a  reasonable  degree  of  variability  in  the  Army  scores,  and  no 
evidence  that  it  was  too  difficult  for  Army  student  aviators  (i.e.,  no  floor  effect).  The  Navy  also 
provided,  for  comparison  purposes,  ASTB  summary  data  for  Navy  personnel  with  varying  levels 
of  education.  (The  ASTB  is  administered  to  Navy  personnel  for  some  purposes  other  than 
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aviator  selection.)  Overall,  the  Army  sample  scores  were  similar  to  those  in  a  mixed-education 
Navy  sample,  and  to  those  in  a  sample  of  Navy  personnel  who  had  at  least  a  bachelor’s  degree. 
The  Army  aviator  sample  included  in  this  effort  had  already  been  selected  via  the  AFAST,  and 
thus  does  not  represent  the  full  range  of  intelligence  in  the  Army  aviator  applicant  population. 
Nevertheless,  it  appears  that  the  ASTB  will  not  prove  to  be  too  difficult,  overall,  for  the  US 
Army  aviator  applicant  sample. 

Air  Force  Officer  Qualification  Test  (AFOQT).  The  US  Air  Force  developed  the  AFOQT 
in  the  early  1950s  as  a  tool  for  selecting  civilian  applicants  for  officer  precommissioning  training 
programs  and  for  classifying  commissionees  into  aircrew  job  specialties  (Rogers,  Roach,  & 

Short,  1986;  Skinner  &  Ree,  1987).  The  Air  Force  has  periodically  revised  the  AFOQT  to 
update  items,  ensure  test  security,  and  improve  predictive  validity.  The  first  form  of  the  AFOQT 
was  implemented  in  1953,  Form  R  is  currently  in  use,  and  Form  S  is  scheduled  for 
implementation  in  the  near  future.  Form  R  has  1 6  subtests  and  Form  S  has  1 1  subtests  that  tap 
verbal,  quantitative,  spatial,  and  mechanical  aptitudes.  Form  S  also  includes  a  measure  of  non- 
cognitive  characteristics,  called  the  Self-Description  Inventory.  Scores  on  the  AFOQT  subtests 
are  used  to  form  five  distinct  but  partially  overlapping  composites:  Pilot,  Navigator-Technical, 
Academic  Aptitude,  Verbal,  and  Quantitative  (Sperl  &  Ree,  1990).  The  Pilot  and  Navigator- 
Technical  composites  are  used  for  classification  into  Undergraduate  Pilot  Training  (UPT)  and 
Undergraduate  Navigator  Training  (UNT),  respectively.  The  AFOQT  has  been  validated  for 
more  than  36  officer  jobs  as  a  predictor  of  technical  training  grades  (Arth,  1986;  Carretta  &  Ree, 
1998).  Carretta  and  Ree  (1994)  found  that  the  AFOQT  showed  a  multiple  correlation  of  .20  for 
predicting  rank  in  UPT,  and  Shore  and  Gould  (2003)  reported  a  multiple  correlation 
(uncorrected)  of  .34  with  UPT  final  grade  for  Form  S  of  the  AFOQT.  According  to  Carretta 
(2002),  “the  predictiveness  of  the  AFOQT  for  aviator  training  performance  comes  almost 
entirely  from  its  measurement  of  g  and  aviation  job  knowledge.”(p.  1). 

As  new  forms  of  the  AFOQT  have  been  constructed  in  recent  years,  key  features  of  the 
subtests  have  deliberately  been  held  constant  to  ensure  equivalent  measurement.  Thus,  the  more 
recent  versions  are  equivalent  in  terms  of  subtest  content,  subtest  length,  item  difficulty,  testing 
time,  and  stylistic  features.  Further,  about  one-half  of  the  items  in  each  form  are  taken  directly 
from  the  previous  form,  and  analyses  are  conducted  to  equate  the  new  form  to  the  old  (Glomb  & 
Earles,  1997).  None  of  the  AFOQT  forms,  to  date,  have  been  computerized,  but  the  USAF 
intends  to  develop  this  capability. 

Like  the  ASTB,  the  AFOQT  is  designed  primarily  for  use  with  a  Commissioned  Officer 
population.  Therefore,  USAF  provided  access  to,  and  permission  to  analyze,  a  normative 
database  containing  scores  on  the  soon-to-be-implemented  AFOQT  Form  S  for  a  Basic  Military 
Training  (BMT)  enlisted  personnel  likely  to  apply  for  the  Airman  Education  and  Commissioning 
Program  ( n  =  509),  Air  Force  Reserve  Officer  Training  Cadets  ( n  =  679),  and  Officer  Training 
School  cadets  ( n  =  462).  The  analyses  are  described  in  Gould  and  Damos  (2005).  They 
conclude,  “As  expected,  the  AFOQT  was  more  difficult  for  the  Air  Force  enlisted  personnel  than 
for  other  commissioning  source  applicants.  However,  the  subtest  and  composite  score 
distributions  are  sufficient  to  discriminate  well  between  enlisted  personnel  if  the  AFOQT  or 
similar  aptitude  test  is  used  for  [aviator]  selection”  (p.  1).  If  the  US  Army  chooses  to  implement 
the  AFOQT,  these  authors  recommend  that  Army-specific  norms  and  passing  score(s)  be 
established. 
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Cognitive  Prioritization  (Popcorn  Test).  The  cognitive  prioritization  test  follows  a 
format  originally  developed  by  NASA  researchers,  and  is  colloquially  known  as  a  “popcorn” 
test.  It  is  a  measure  of  cognitive  processing,  specifically  the  ability  to  prioritize  several  moving 
stimuli  that  appear  on  a  computer  screen.  No  operational  pilot  selection  test  battery  has  included 
a  popcorn  test,  although  other  test  batteries,  for  example,  Wombat©  (Aero  Innovation,  1998), 
likely  measure  the  same  or  a  similar  underlying  ability.  It  is  recommended  that  the  US  Army 
include  this  type  of  test  in  its  Army  aviator  selection  battery  because  this  ability  may  become 
increasingly  important  in  the  future,  as  the  cognitive  load  associated  with  flying  rotary- wing 
aircraft  increases.  Scores  on  this  test  may  also  be  related  to  measures  of  situational  awareness. 

Perceptual  Speed  and  Accuracy.  One  possible  measure  of  perceptual  speed  and  accuracy 
is  the  Table  Reading  test  that  is  a  subtest  of  the  AFOQT.  This  test  has  been  in  use  for  aviator 
selection  since  1942.  It  continues  to  account  for  unique  variance  in  prediction  of  aviator 
performance,  and  is  part  of  the  AFOQT  Pilot  Composite  score.  A  commercial  version  of  the  test 
is  also  available. 

Alternatively,  it  would  be  possible  to  develop  a  new  measure  of  perceptual  speed  and 
accuracy,  using  stimulus  materials  that  are  face  valid  for  Army  aviators.  PDRI  has  developed 
many  different  measures  of  perceptual  speed  and  accuracy,  and  could  do  so  efficiently  in  the 
current  project. 

Stage  1:  N on-Cognitive  Measures 

Test  of  Adaptable  Personality  (TAP).  The  TAP  was  developed  by  the  US  Army  for  use 
in  training  and  developing  Special  Forces  Soldiers  and  officers.  It  consists  of  biodata  items  that 
were  written  to  target  constructs  such  as  achievement  orientation,  fitness  motivation,  cognitive 
flexibility,  peer  leadership,  and  interpersonal  skills.  In  Special  Forces  samples,  the  achievement 
orientation,  fitness  motivation,  and  cognitive  flexibility  scales  have  proven  valid  for  predicting 
peer  and  supervisor  ratings  of  performance  (personal  communication,  R.  Kilcullen,  November, 
2005;  Kilcullen,  Goodwin,  Chen,  Wisecarver,  &  Sanders,  undated;  Kilcullen,  Mael,  Goodwin,  & 
Zazanis,  1999). 

Assessment  of  Individual  Motivation  (AIM).  The  AIM  is  a  forced-choice  non-cognitive 
inventory  that  measures  several  constructs  potentially  important  for  aviator  selection.  It  was 
developed  by  researchers  at  ARI,  and  was  developed  to  measure  most  of  the  same  constructs  as 
the  Assessment  of  Background  and  Life  Experiences  (ABLE)  developed  during  Project  A.  In 
Project  A,  the  ABLE  was  predictive  of  volitional  aspects  of  performance  in  a  variety  of  military 
enlisted  jobs,  and  it  exhibited  incremental  validity  when  added  to  a  cognitive  test  battery  (Russell 
&  Peterson,  2001).  However,  the  ABLE  was  never  implemented  for  selection  purposes  due  to 
concerns  about  its  fakability  (White,  Young,  &  Rumsey,  2001). 

The  AIM  specifically  addresses  fakability  concerns  by  using  the  forced-choice 
methodology.  This  methodology  has  long  been  suggested  as  a  way  to  make  an  inventory 
resistant  to  faking,  and  there  is  evidence  to  support  this  claim,  some  of  it  specifically  based  on 
the  AIM  (Jackson,  Wroblewski,  &  Ashton,  2000;  White,  et  al.,  2001).  ARI  is  also  currently 
funding  efforts  to  explore  an  Item  Response  Theory  (IRT)-based  approach  to  administering  and 
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scoring  the  AIM,  in  an  attempt  to  make  it  even  more  resistant  to  faking  (Stark,  Chernyshenko,  & 
Drasgow,  2003). 

To  date,  research  on  the  AIM  has  focused  primarily  on  predicting  attrition,  but  there  is 
some  evidence  that  it  predicts  job  performance  and  personal  discipline  among  correctional 
officers  in  military  prisons  as  well  as  success  in  explosive  ordinance  disposal  training  for 
military  personnel  (White,  et  al.,  2001).  Project  A  results  suggest  it  is  reasonable  to  believe  that 
the  AIM  will  predict  volitional  aspects  of  job  performance  for  the  Army  aviator  job,  because  it 
measures  characteristics  important  for  performing  that  job. 

The  AIM  is  currently  used  for  operational  recruit  screening  as  part  of  the  US  Army’s 
GED  Plus  program,  and  it  has  shown  promise  for  use  in  pre-enlistment  screening  of  Non  High 
School  Graduate  (NHSG)  recruits  (White,  Young,  Heggestad,  Stark,  Drasgow,  &  Piskator,  2004, 
2005).  It  is  also  being  evaluated  for  potential  use  in  screening  of  US  Army  recruiters  and  drill 
sergeants.  Researchers  have  also  developed  various  scoring  methods  in  an  effort  to  enhance  the 
validity  of  the  AIM  for  predicting  attrition,  including  empirical  scoring  procedures  (White, 
Young,  Heggestad,  Stark,  Drasgow,  &  Piskator,  2005),  an  IRT-based  scoring  approach 
(Chemyshenkso,  Stark,  &  Drasgow,  2003),  and  a  decision  tree  approach  (Lee  &  Drasgow, 

2003). 


Self-Description  Inventory  Plus  (SD1+).  The  SD1  was  developed  by  the  USAF,  and  is 
currently  considered  an  experimental  subtest  within  Form  S  of  the  AFOQT.  It  was  originally 
developed  to  measure  the  Big  Five  personality  factors.  In  recent  years,  USAF  researchers  wrote 
two  additional  scales  to  measure  Team  Orientation  and  Commitment  to  Military  Service  (Service 
Orientation).  It  contains  220  items  (see  Christal,  et  al.,  1997).  According  to  USAF  researchers, 
the  value  of  the  SDI  is  in  generating  profiles  for  people  and  ultimately  profiles  for  organizations 
and  job  families  to  facilitate  person-job  match  and  strategic  force  development  (J.  Weissmuller, 
personal  communication,  February  28,  2005).  It  is  not  specifically  intended  for  personnel 
selection.  Nevertheless,  validity  data  are  currently  being  collected  in  a  broad  USAF  sample, 
including  some  aviators. 

Armstrong  Laboratory  Aviation  Personality  Scale  (ALAPS).  The  ALAPS  was  also 
developed  by  the  USAF  (Retzlaff,  King,  Callister,  Orme,  &  Marsh,  2002).  It  includes  five 
“personality”  scales  (confidence,  socialness,  aggressiveness,  orderliness,  and  negativity),  six 
“crew  interaction”  scales  (dogmatism,  deference,  team  orientation,  organization,  impulsivity,  and 
risk-taking),  and  four  “psychopathology”  scales  (affective  lability,  anxiety,  depression,  and 
alcohol  abuse).  A  large-scale  validation  study  is  currently  underway  by  the  USAF.  The  US 
Navy  is  also  planning  to  conduct  validation  research  on  this  inventory.  However,  further 
investigation  into  this  inventory  revealed  the  unfortunate  fact  that  the  items  and  scoring  key  have 
been  published  in  a  USAF  technical  report  that  is  available  to  members  of  the  general  public  who 
are  savvy  enough  to  locate  it.  Therefore,  it  would  be  unwise  for  the  US  Army  to  use  this 
inventory  for  selection. 

New  Non-cognitive  Scales.  Based  on  our  review  of  the  existing  inventories,  there  are 
several  non-cognitive  characteristics  that  may  be  predictive  of  aviator  performance  that  are  not 
measured  by  any  of  the  inventories  readily  accessible  to  the  US  Army.  Therefore,  it  may  be 
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advisable  to  write  new  scales  targeting  these  characteristics,  emulating  the  style  and  format  of 
the  items  in  the  TAP. 

Stage  2:  Psychomotor  Skills  and  Multiple-Task  Performance  (Performance-Based  Measures) 

Test  of  Basic  Aviation  Skills  (TBAS).  This  test  battery  was  developed  by  the  USAF  as  a 
replacement  for  the  BAT.  It  includes  three  subtests  designed  to  measure  spatial  orientation 
(tracking  tasks)  and  multiple  task  performance  skills  (tracking  plus  directed  listening),  as  well  as 
the  ability  to  make  decisions  under  stress.  TBAS  is  scheduled  for  fielding  in  2006  and  the  US 
Navy  is  also  considering  adding  it  to  their  aviator  selection  process.  Ree  (2003)  analyzed  TBAS 
data  from  USAF  aviator  trainees  (n  =  531)  who  had  already  been  selected  on  other  measures  and 
found  that  the  spatial  orientation  and  decision-making  subtests  showed  low  but  significant 
correlations  with  various  training  criterion  measures.  The  multiple  correlation  for  predicting  a 
combined  training  performance  measure  (based  on  check  ride  scores,  instructor  ratings,  and  quiz 
scores,  among  other  things)  was  .33;  the  multiple  correlation  for  predicting  UPT  pass/fail  was 
.3 1 .  It  does  not  appear  that  these  correlations  were  corrected  for  shrinkage  but  the  author  notes 
that  they  were  downwardly  biased  due  to  a  high  degree  of  range  restriction  on  the  predictor 
measures.  There  are  some  concerns  about  the  stability  of  scores  on  the  decision-making  subtest. 
Ree  (2004b)  examined  90-day  and  180-day  test-retest  reliability  in  a  small  sample  ( n  =  126)  of 
USAF  aviator  trainees.  Reliability  was  very  low  for  the  decision-making  subtest  (.15)  and 
acceptable  for  the  other  subtest  scores  (.56-. 75).  Further  investigation  of  TBAS  with  USAF 
personnel  revealed  the  unfortunate  fact  that  no  documentation  regarding  the  computer 
programming  appears  to  exist,  nor  any  documentation  about  how  the  dependent  variables  are 
calculated.  For  this  reason,  it  is  not  recommended  that  the  TBAS  be  used  for  Army  aviator 
selection  unless  and  until  program  documentation  can  be  located. 

Wombat©.  The  Wombat©  (Aero  Innovation,  1998)  is  a  commercially-available, 
computerized  test  battery  that  involves  learning  and  operating  a  complex  system.  It  does  not 
involve  discrete  subtests,  but  rather  involves  continuous  performance  on  a  primary  tracking  task, 
with  secondary  performance  on  any  of  three  “bonus”  tasks.  The  bonus  tasks  are  worth  varying 
amounts  of  points  at  different  times.  All  of  the  measures  are  combined  to  create  a  total 
efficiency  score.  During  the  testing  period,  examinees  are  given  continuous  feedback  on  their 
performance  which  can  help  them  maximize  their  task  performance  strategies,  to  the  extent  that 
they  have  the  attentional  and  cognitive  capacity  to  do  so.  There  has  been  little  published  on  the 
validity  of  the  Wombat©,  but  two  studies  suggest  that  scores  are  correlated  with  academic 
performance  in  flight  school  and  flight  hours  (Cain,  2002;  Frey,  Thomas,  Walton,  &  Wheeler, 
2001).  The  Wombat©  has  been  used  extensively  for  aviator  selection  in  Canada,  but  has  not 
been  used  operationally  by  the  US  military,  as  the  advertised  pricing  is  prohibitive. 

New  Performance-Based  Measure.  If  neither  the  TBAS  nor  the  Wombat©  are  viable 
alternatives,  it  is  recommended  that  the  US  Army  develop  its  own  performance-based  measure  of 
psychomotor  skills  and  multiple-task  performance.  This  recommendation  is  being  made  because 
there  does  not  appear  to  be  another  performance-based  measure  that:  1)  has  proven  validity;  2)  is 
programmed  in  a  modem  programming  language;  and,  3)  is  readily  available  and  free  to  the  US 
Army. 
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The  new  test  battery  could  include  subtests  similar  to  psychomotor  tests  with  a  long 
history  and  proven  validity,  for  example,  the  Complex  Coordination  test  and  the  Rotary  Pursuit 
test,  but  programmed  in  a  modem  programming  language.  Multiple-task  performance  could  be 
assessed  by  combining  a  directed  listening  test,  or  some  other  secondary  task,  with  a 
psychomotor  task.  For  example,  it  might  be  possible  to  use  the  TBAS  as  a  model  for 
development  but  with  careful  documentation  of  the  programming  and  development  of  scoring 
variables. 

Conclusions 

This  report  presents  a  review  of  a  great  deal  of  research  in  the  aviator  selection  and 
general  personnel  selection  domains.  That  information  was  used  to  identify  KSAOs  that  should 
be  included  in  a  job  analysis  study  focusing  on  the  Army  aviator  job.  It  was  further  used  to 
develop  a  recommended  strategy  for  an  Army  aviator  selection  battery. 

Research  focused  specifically  on  aviator  selection,  as  well  as  general  personnel  selection 
research,  clearly  suggests  that  cognitive  ability,  or  general  intelligence  (g),  will  be  an  important 
predictor  of  aviator  performance.  More  specific  cognitive  abilities  that  may  be  of  importance 
include:  general  reasoning;  spatial  ability;  mechanical  reasoning;  quantitative  ability;  verbal 
ability;  multiple-task  performance  (also  known  as  timesharing  or  divided  attention);  and 
information  processing  (e.g.,  perceptual  speed  and  accuracy,  working  memory,  cognitive  task 
prioritization).  Research  also  suggests  that  measures  of  aviation  or  helicopter  knowledge, 
interest  in  aviation,  flying  experience,  and  normal-range  personality  characteristics  are  likely  to 
enhance  the  validity  of  the  overall  selection  process.  Non-cognitive  traits  that  seem  relevant  for 
the  aviator  job  include:  conscientiousness/integrity;  achievement  orientation;  stress 
tolerance/emotional  stability;  adaptability/cognitive  flexibility;  interpersonal/crew  interaction 
skills;  risk  tolerance;  internal  locus  of  control;  and,  dominance/potency  (including  self- 
confidence/self-esteem). 

The  results  of  this  review,  then,  suggest  a  selection  strategy  for  Army  aviation  that 
includes  measures  of  cognitive  abilities  such  as  spatial  ability,  mechanical  reasoning,  verbal 
ability,  numerical  reasoning,  perceptual  speed  and  accuracy,  and  cognitive  prioritization,  as  well 
as  a  measure  that  would  attempt  to  tap  motivation  to  become  an  aviator.  In  addition,  incremental 
validity  may  be  achieved  by  including  non-cognitive  measures  such  as  the  TAP  and  AIM,  as 
well  as  other  normal-range  personality  inventories. 

From  a  practical  perspective,  the  test  battery  could  be  administered  via  the  Internet  on 
any  standard  desktop  computer  and  would  not  require  any  non-standard  peripherals  or  hardware. 
Thus,  it  could  be  administered  virtually  anywhere  that  a  computer  is  available,  along  with  a 
reliable  Internet  connection  and  a  test  control  officer.  In  fact,  the  Army  may  be  able  to  take 
advantage  of  the  fact  that  the  US  Navy  has  a  web-enabled  aviator  selection  battery  that  currently 
consists  of  a  reasonable  set  of  cognitive  tests. 

The  addition  of  measures  that  focus  on  psychomotor  skills  and  multiple-task 
performance,  often  labeled  “performance-based”  measures,  is  recommended.  However,  practical 
constraints  on  time  and  resources  might  suggest  that  these  tests  be  considered  as  candidates  for 
inclusion  in  an  aviator  tracking  battery,  to  assist  in  the  classification  of  selected  Army  aviators 
into  mission/aircraft  types.  This  recommended  “Stage  2”  in  the  selection/classification  process 
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is,  in  fact,  the  next  scheduled  research  and  development  effort  for  the  ARI  Rotary-Wing  Aviation 
Research  Unit  at  Fort  Rucker. 
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Mechanical  Comprehension  10  min 


Key  Reference(s) 
Describing  Test  or 

Composite  Documenting 

Score/Test  Subtest  Length  Summary  of  Validity  Evidence  Comments  Validity 


PCMS  score.  N=676.  Correlation  at  this  time 
between  pass/fail  and  PCMS  was  0.34 
uncorrected  and  0.48  when  the  criterion 
variable  was  corrected  for 
dichotimization 
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[BAT con’t]  Psychomotor  By  1993,  psychomotor  DVs 

were  combined  into  a 
composite 


Key  Reference(s) 
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Score/Test  Subtest  Length  Summary  of  Validity  Evidence  Comments  Validity 
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Web-Enabied. 

Does  not  require  non-standard 
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Key  Reference(s) 
Describing  Test  or 

Composite  Documenting 

Score/Test  Subtest  Length  Summary  of  Validity  Evidence  Comments  Validity 
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Key  Reference(s) 
Describing  Test  or 

Composite  Documenting 

Score/Test  Subtest  Length  Summary  of  Validity  Evidence  Comments  Validity 
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Score/Test  Subtest  Length  Summary  of  Validity  Evidence  Comments  Validity 


Sensory  Motor  Apparatus  Traditional  RAF  test.  Probably 

cannot  be  Web-Enabled. 
Requires  non-standard 
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Picture  completion 


Key  Reference(s) 
Describing  Test  or 

Composite  Documenting 

Score/Test  Subtest  Length  Summary  of  Validity  Evidence  Comments  Validity 
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Appendix  B 

Overview  of  Non-Cognitive  Inventories  that  may  be  Relevant  for  Aviator  Selection 
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Developments  Been  Tried  Key  Reference(s) 

Scoring  for  Pilot  Describing  Test  or 

Name  Scales  Approach  Selection?  Summary  of  Validity  Evidence  Documenting  Validity  Comments 
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Developments  Been  Tried  Key  Reference(s) 

Scoring  for  Pilot  Describing  Test  or 

Name  Scales  Approach  Selection?  Summary  of  Validity  Evidence  Documenting  Validity  Comments 


differences  between  male  and  Pettitt  and  Dunlap 

female  aviation  students  (1995) 

disappeared  by  the  third  year.  8  of 
the  36  facets  showed  significant 
correlations  with  GPAs 


Developments  Been  Tried  Key  Reference(s) 

Scoring  for  Pilot  Describing  Test  or 

Name  Scales  Approach  Selection?  Summary  of  Validity  Evidence  Documenting  Validity  Comments 


Competence  &  Dutifulness 
facet  scores  within  the 
Conscientiousness  factor.) 
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there  should  be  no  charge  to  use  any  of  the  scales. 
We  can’t  use  the  ALAPS  without  modification 
because  the  items  and  scoring  were  published  in  a 
USAF  technical  report. 
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